Re: Module to intelligently handle names with two last names and other best practices?
by hippo (Bishop) on Jun 27, 2018 at 16:16 UTC
|
Names are illogical so forget trying to parse them arbitrarily. Treat each full name as an indivisible unit.
| [reply] |
|
True, but for 99.9% of the names in my database which are traditional American names, it would be useful to properly handle the names. I'm not too worried about the edge cases. I just want to be able to try to sort by the different types of last names. It would be useful for my purposes.
| [reply] |
|
True, but for 99.9% of the names in my database which are traditional American names
Like Wernher Magnus Maximilian Freiherr von Braun, US citizen since April 15, 1955? Even the rules for processing his name in Germany, Austria, and Switzerland are everything but homogeneous. How would you process it? (Hint: wrong. No matter how you process it.)
If Wernher was an Austrian citizen, his name would be illegal. No "von", no "Freiherr". Just "Wernher Magnus Maximilian Braun". So if some naive Austrian coder would have written a database, he would perhaps automatically remove "von" and "Freiherr". But then again, he would be wrong. Wernher was US citizen, and he previously was a German citizen, so he could legally use his full name, including "Freiherr von", in Austria.
And how would you handle those people? Or this one, for whom Wikipedia lists 10 different names? Or "Ludwell Ebersole Gaines Sr." (random pick from http://politicalstrangenames.blogspot.com/), Charles Emerson Winchester III, Charles "Trip" Tucker III?
Alexander
--
Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)
| [reply] |
|
| [reply] |
|
Re: Module to intelligently handle names with two last names and other best practices?
by davies (Prior) on Jun 27, 2018 at 17:15 UTC
|
I would agree with AnomalousMonk that "Roy-Mayweather, Donna" looks correct. Where I have a serious problem with such modules is where spaces appear in the surname, such as Valéry Giscard d'Estaing, which should be rendered as "Giscard d'Estaing, Valéry". The best way I can think of to solve such problems is to replace spaces in surnames with a character like  , but I don't know if that would be treated as a space by other software.
Regards,
John Davies
| [reply] [d/l] |
|
| [reply] [d/l] |
Re: Module to intelligently handle names with two last names and other best practices?
by AnomalousMonk (Archbishop) on Jun 27, 2018 at 16:28 UTC
|
... "Donna Roy-Mayweather" gets converted to "Roy-Mayweather, Donna."
I don't understand. I would have said that "Roy-Mayweather" was the entire and correct surname of this person. What result do you want to get?
Update: Oh, or did you mean that "Donna Roy-Mayweather" should be converted to "Roy-Mayweather, Donna", but the Text::Names module does not do this correctly? (Maybe a code example would help, maybe something along the lines of How to ask better questions using Test::More and sample data?)
Give a man a fish: <%-{-{-{-<
| [reply] [d/l] |
|
No, I was thinking "Mary Stevens Burgess" should probably be converted to "Mary Stevens-Burgess". But the answer is no, it shouldn't. See my comment below.
| [reply] |
Re: Module to intelligently handle names with two last names and other best practices?
by Paladin (Vicar) on Jun 27, 2018 at 16:26 UTC
|
If your assumption is that everyone in your data only has 1 first name, and everything else is a multi-part last name, they stick hyphens in place of all the spaces except the first, then use cleanNames to handle them as it appears to handle hyphenated names fine from your example.
Of course, if you can't assume there's only 1 first name for everyone, you need to make a set of rules on how to decide what's part of a first name, what's part of a last name. Then hyphenate appropriately, and again, use cleanNames as appropriate.
| [reply] [d/l] [select] |
Re: Module to intelligently handle names with two last names and other best practices?
by BrowserUk (Patriarch) on Jun 27, 2018 at 18:09 UTC
|
I concur with the advice to store names as indivisible units -- there are several cultures in which the family name comes first and the individual name is second; and either or both can consist of two unhyphenated words.
One tip though is to index the full names by *all* the words including initials -- they contain for matching purposes.
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
In the absence of evidence, opinion is indistinguishable from prejudice.
Suck that fhit
| [reply] |
|
... index the full names by *all* the words ...
I don't understand that. Wouldn't that just result in an index choked with a zillion John/Jan/Jacob/Juan/Joan/Ioan/Ian/Ivan/... useless entries (and that's just a few of the European variations of the male version of one, simple given name). I agree it's a very frustrating problem. In the end, aren't we stuck with coming up with some massive AI to figure it all out and keep it straight and usable?
Give a man a fish: <%-{-{-{-<
| [reply] [d/l] |
|
Yes, but it can allow you to match "J L Smith" against "Mr John L. Smith" or even "Jim Smithe" and similar variations. Usually that alone won't be enough, but combined with other data and perhaps human oversight it can allow you to detect/eliminate duplicates. (You do have to be careful of John Smith Senior and John Smith Junior who live at the same address.)
I implemented this years ago on a 6 million name dataset and it (helped) illiminate over 20,000 misreads, data entry errors and fraudulent attempts. The indexes were implemented as bitfields -- 1 bit per full name -- which allowed any entry to be rapidly compared against all the names and reduced to a tiny subset of the 6 million for further manual investigation.
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
In the absence of evidence, opinion is indistinguishable from prejudice.
Suck that fhit
| [reply] |
Re: Module to intelligently handle names with two last names and other best practices?
by nysus (Parson) on Jun 27, 2018 at 18:13 UTC
|
OK, so according the Chicago Manual of Style, double names without hyphens should be sorted on the very last part of the name:
"In the absence of a hyphen, alphabetize by the final name. Since it’s usually not possible to know for certain the origin of the name in the middle, it is treated as a middle name (not a surname) by default. Not observing this simple rule would lead to chaos: Chantelle Rutherford Smith would be listed in some directories under Rutherford and others under Smith, even if Rutherford is a middle name she was given at birth. (Note that Spanish names have their own rules; please see CMOS 16.84.)"
| [reply] |
Re: Module to intelligently handle names with two last names and other best practices?
by Anonymous Monk on Jun 27, 2018 at 21:31 UTC
|
Huh ... sure seems to me that Ms. Donna's name was handled exactly as I would have expected! | [reply] |