http://www.perlmonks.org?node_id=1217518

nysus has asked for the wisdom of the Perl Monks concerning the following question:

I'm looking for a module that will intelligently handle the names of people in my database. I used the cleanNames function of the Text::Names but unfortunately it didn't handle people with hyphenated names very well.

For example, someone with a name of "Mary Burgess Stevens" gets translated into "Stevens, Mary Burgess" while someone named "Donna Roy-Mayweather" gets converted to "Roy-Mayweather, Donna."

I'm wondering if anyone has done any work with hyphenated/double last names and how you tackled this problem.

$PM = "Perl Monk's";
$MCF = "Most Clueless Friar Abbot Bishop Pontiff Deacon Curate Priest";
$nysus = $PM . ' ' . $MCF;
Click here if you love Perl Monks

Replies are listed 'Best First'.
Re: Module to intelligently handle names with two last names and other best practices?
by hippo (Bishop) on Jun 27, 2018 at 16:16 UTC

    Names are illogical so forget trying to parse them arbitrarily. Treat each full name as an indivisible unit.

      True, but for 99.9% of the names in my database which are traditional American names, it would be useful to properly handle the names. I'm not too worried about the edge cases. I just want to be able to try to sort by the different types of last names. It would be useful for my purposes.

      $PM = "Perl Monk's";
      $MCF = "Most Clueless Friar Abbot Bishop Pontiff Deacon Curate Priest";
      $nysus = $PM . ' ' . $MCF;
      Click here if you love Perl Monks

        True, but for 99.9% of the names in my database which are traditional American names

        Like Wernher Magnus Maximilian Freiherr von Braun, US citizen since April 15, 1955? Even the rules for processing his name in Germany, Austria, and Switzerland are everything but homogeneous. How would you process it? (Hint: wrong. No matter how you process it.)

        If Wernher was an Austrian citizen, his name would be illegal. No "von", no "Freiherr". Just "Wernher Magnus Maximilian Braun". So if some naive Austrian coder would have written a database, he would perhaps automatically remove "von" and "Freiherr". But then again, he would be wrong. Wernher was US citizen, and he previously was a German citizen, so he could legally use his full name, including "Freiherr von", in Austria.

        And how would you handle those people? Or this one, for whom Wikipedia lists 10 different names? Or "Ludwell Ebersole Gaines Sr." (random pick from http://politicalstrangenames.blogspot.com/), Charles Emerson Winchester III, Charles "Trip" Tucker III?

        Alexander

        --
        Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)
        99.9% of the names in my database which are traditional American names

        You mean like Sitting Bull? Good luck calling him "Mr. Bull" to his face.

Re: Module to intelligently handle names with two last names and other best practices?
by davies (Prior) on Jun 27, 2018 at 17:15 UTC

    I would agree with AnomalousMonk that "Roy-Mayweather, Donna" looks correct. Where I have a serious problem with such modules is where spaces appear in the surname, such as Valéry Giscard d'Estaing, which should be rendered as "Giscard d'Estaing, Valéry". The best way I can think of to solve such problems is to replace spaces in surnames with a character like &nbsp, but I don't know if that would be treated as a space by other software.

    Regards,

    John Davies

      ... spaces appear in the surname ...

      And then there's the Hispanic custom of having both paternal and maternal surnames (with spaces!). Ah, humans and their wacky ways. Go figure.


      Give a man a fish:  <%-{-{-{-<

Re: Module to intelligently handle names with two last names and other best practices?
by AnomalousMonk (Archbishop) on Jun 27, 2018 at 16:28 UTC
    ... "Donna Roy-Mayweather" gets converted to "Roy-Mayweather, Donna."

    I don't understand. I would have said that "Roy-Mayweather" was the entire and correct surname of this person. What result do you want to get?

    Update: Oh, or did you mean that "Donna Roy-Mayweather" should be converted to "Roy-Mayweather, Donna", but the Text::Names module does not do this correctly? (Maybe a code example would help, maybe something along the lines of How to ask better questions using Test::More and sample data?)


    Give a man a fish:  <%-{-{-{-<

      No, I was thinking "Mary Stevens Burgess" should probably be converted to "Mary Stevens-Burgess". But the answer is no, it shouldn't. See my comment below.

      $PM = "Perl Monk's";
      $MCF = "Most Clueless Friar Abbot Bishop Pontiff Deacon Curate Priest";
      $nysus = $PM . ' ' . $MCF;
      Click here if you love Perl Monks

Re: Module to intelligently handle names with two last names and other best practices?
by Paladin (Vicar) on Jun 27, 2018 at 16:26 UTC

    If your assumption is that everyone in your data only has 1 first name, and everything else is a multi-part last name, they stick hyphens in place of all the spaces except the first, then use cleanNames to handle them as it appears to handle hyphenated names fine from your example.

    Of course, if you can't assume there's only 1 first name for everyone, you need to make a set of rules on how to decide what's part of a first name, what's part of a last name. Then hyphenate appropriately, and again, use cleanNames as appropriate.

Re: Module to intelligently handle names with two last names and other best practices?
by BrowserUk (Patriarch) on Jun 27, 2018 at 18:09 UTC

    I concur with the advice to store names as indivisible units -- there are several cultures in which the family name comes first and the individual name is second; and either or both can consist of two unhyphenated words.

    One tip though is to index the full names by *all* the words including initials -- they contain for matching purposes.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". The enemy of (IT) success is complexity.
    In the absence of evidence, opinion is indistinguishable from prejudice. Suck that fhit
      ... index the full names by *all* the words ...

      I don't understand that. Wouldn't that just result in an index choked with a zillion John/Jan/Jacob/Juan/Joan/Ioan/Ian/Ivan/... useless entries (and that's just a few of the European variations of the male version of one, simple given name). I agree it's a very frustrating problem. In the end, aren't we stuck with coming up with some massive AI to figure it all out and keep it straight and usable?


      Give a man a fish:  <%-{-{-{-<

        Yes, but it can allow you to match "J L Smith" against "Mr John L. Smith" or even "Jim Smithe" and similar variations. Usually that alone won't be enough, but combined with other data and perhaps human oversight it can allow you to detect/eliminate duplicates. (You do have to be careful of John Smith Senior and John Smith Junior who live at the same address.)

        I implemented this years ago on a 6 million name dataset and it (helped) illiminate over 20,000 misreads, data entry errors and fraudulent attempts. The indexes were implemented as bitfields -- 1 bit per full name -- which allowed any entry to be rapidly compared against all the names and reduced to a tiny subset of the 6 million for further manual investigation.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority". The enemy of (IT) success is complexity.
        In the absence of evidence, opinion is indistinguishable from prejudice. Suck that fhit
Re: Module to intelligently handle names with two last names and other best practices?
by nysus (Parson) on Jun 27, 2018 at 18:13 UTC

    OK, so according the Chicago Manual of Style, double names without hyphens should be sorted on the very last part of the name:

    "In the absence of a hyphen, alphabetize by the final name. Since it’s usually not possible to know for certain the origin of the name in the middle, it is treated as a middle name (not a surname) by default. Not observing this simple rule would lead to chaos: Chantelle Rutherford Smith would be listed in some directories under Rutherford and others under Smith, even if Rutherford is a middle name she was given at birth. (Note that Spanish names have their own rules; please see CMOS 16.84.)"

    $PM = "Perl Monk's";
    $MCF = "Most Clueless Friar Abbot Bishop Pontiff Deacon Curate Priest";
    $nysus = $PM . ' ' . $MCF;
    Click here if you love Perl Monks

Re: Module to intelligently handle names with two last names and other best practices?
by Anonymous Monk on Jun 27, 2018 at 21:31 UTC
    Huh ... sure seems to me that Ms. Donna's name was handled exactly as I would have expected!