Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Handling caps for surnames with capitals in the middle (was: Irish Surnames)

by Baz (Friar)
on May 06, 2002 at 11:22 UTC ( #164280=perlquestion: print w/ replies, xml ) Need Help??
Baz has asked for the wisdom of the Perl Monks concerning the following question:

Heres are examples of some Isish Surnames -

McGinley
MacGee
DeVelera

With most surnames you can convert its lower case string to its correct form by calling ucfirst, but for the cases above, things are a bit more complicated. Anyway, does anyone have any suggestions on how to handle the above names.

Edit kudra, 2002-05-06 Changed title

Comment on Handling caps for surnames with capitals in the middle (was: Irish Surnames)
Re: Handling caps for surnames with capitals in the middle (was: Irish Surnames)
by ariels (Curate) on May 06, 2002 at 11:48 UTC

    Doesn't this depend on the person? I believe you can find Macdonalds and MacDonalds (not to mention McDonald's) coexisting in the phone books...

    It looks like you're short of luck: you have a social problem, so a technical solution won't do.

      Hmmm...well all the surnames im accessing are stored in a database as lower case strings. When I print them out I'd like to display them in the form I've discribed above i.e. mcgee to McGee and develere to DeVelera.

        If it were my surname you were mauling, I'd be more than a bit annoyed. I get enough of my surname's Anglicised (actually Brazilinated, but it's close enough) form being auto-"corrected" into the original Russian. I am actually capable of spelling it correctly, and I do just that.

        That said, you could

        $surname =~ s/^((?:Mc|Mac|De|Da|Du)?)(.*)$/\u\1\u\2/i;
        (assumes the surname starts off lower-cased).

        Other problems: not smart enough. E.g. <samp>'mack'</samp> becomes <samp>'MacK'</samp>, which is wrong; you might want .{3,} instead of .* in the regexp.

        But I really don't think you should be doing this...

        Wel, depending on whether you're validating (i.e. someone submits their name and you want th check if it's in the database or not) or doing something for each entry in a query (give me a list of every person in the database who likes cheeseburgers), you could just canonicalize in your query. Observe:
        $sth=$dbh->prepare(qq(select * from table where lower(last_name)=?)) o +r die "$dbh->errstr"; my ($new_last_name = $last_name) =~ tr/[A-Z]/[a-z]/; $sth->execute($new_last_name) or die "$dbh->errstr";
        Now, you can store the names on the database how ever you want, do your query, and return what the user actually entered in as their name (many have made the point that they know how to spell their own name). By translating both your string and what's on the database to lowercase, you are going to find the match regardless of what case it is on the database. Be warned though that this basically destroys any indexing that you may have had on that field, because the database doesn't know what the results of the lower() function will be until it actually does it. So, it must do it on every record on the table.
      But you might make an educated guess when you have no (reasonable) capitalisation. I think that's how humans treat this problem.

      i.e.

      "MacDonalds" eq handle_caps("irish","macdonalds"); "Macdonalds" eq handle_caps("irish","Macdonalds"); "MacDonalds" eq handle_caps("irish","MaCDOnaLDS");

      Assuming that "MacDonalds" is the 'preferred' spelling in Irish...

      - Joost.

How about Dutch Surnames?
by Joost (Canon) on May 06, 2002 at 11:53 UTC
    Maybe even more exciting are the following Dutch surnames:

    de Vries
    van de Kamp
    van Limburg Stierum
    van Holthe tot Echten
    de Vos tot Nederveen Cappel
    or even
    Olde Reuver of Briel
    Rutten bij- of meergenaamd Verbeek (!)

    I guess you'd have to know the nationality of the persons involved, and then try some heuristics to detect stuff like 'Mac','Mc','van','de' (in dutch, with a space), 'De' (in Irish, without space) etc.

    I don't know of any CPAN module that does this, but it sounds like an interesting project :-)

      Anyway, here's a quick hack to show the heuristic:

      #!/usr/bin/perl -w use strict; for (qw(mcginley macgee develera)) { print "$_ => ".handle_caps($_)."\n"; } sub handle_caps { # this assumes irish capitalisation! my $name = ucfirst(shift); for (qw(Mc Mac)) { # always ok $name =~ s/^$_(.*)/$_\u$1/; } for (qw(De)) { # may not be followed by [aoeiu] # ?? don't know enough irish # for this rule ;-) $name =~ s/^$_([^aoeiu].*)/$_\u$1/; } return $name; }
Re: Handling caps for surnames with capitals in the middle (was: Irish Surnames)
by Sidhekin (Priest) on May 06, 2002 at 12:12 UTC

    Anyway, does anyone have any suggestions on how to handle the above names.

    The best way to handle this problem is to not throw away information, but rather store the name with correct case.

    Just like the Y2k-problem, it is trivial to store correctly in the first place and non-trivial to restore the correct data after information (century or case) has been thrown away. And if you should ever need the simpler form (for presentation as two-digit year or for case-insensitive sorting of names), it is easier by far to produce from the correct form than the other way around.

    I guess everybody knew it, but I felt somebody ought to say it :-\

    The Sidhekin
    print "Just another Perl ${\(trickster and hacker)},"

      This is the way I would do it also. Store the data exactly as the user enters it (or supplies it). Then use either functional indexes if your back-end supports it or add another field that is the canonacalized version (all lower, all upper, stripped of spaces, etc).

      -derby

Re: Handling caps for surnames with capitals in the middle (was: Irish Surnames)
by Anonymous Monk on May 06, 2002 at 12:22 UTC
    First build a list (i ain't smart), then use it
    my @List = qw/ Mac Mc De /; my @NAMES = map ucfirst, qw/ mcginley macgee develera/; for my $name (@NAMES) { for my $match( grep {$name =~/^$_/} @List ) { warn $name; $name =~ s/^\Q$match\E(.*)$/$match.ucfirst($1)/e; warn $name; } }
    The output from the above code I get is
    Mcginley at - line 6.
    McGinley at - line 8.
    Macgee at - line 6.
    MacGee at - line 8.
    Develera at - line 6.
    DeVelera at - line 8.
Re: Handling caps for surnames with capitals in the middle (was: Irish Surnames)
by mrbbking (Hermit) on May 06, 2002 at 14:49 UTC
    One of the Best Things About Perl is the CPAN.
    One of the Best Things About helping CPAN-Testers is that you learn of modules you otherwise wouldn't...

    This looks like a job for Lingua::EN::NameCase.

    It misses on 'DeVelera', but gets quite a few others. Perhaps you could patch it to handle some of the names it currently misses, and submit your patch to the author?
    (or maybe the 'EN' is being narrowly interpreted...)

    use Lingua::EN::NameCase qw( nc ); foreach my $name(qw( mcginley macgee develera )){ print nc( $name ), "\n"; }
    See my homenode for more about testing for CPAN.
Re: Handling caps for surnames with capitals in the middle (was: Irish Surnames)
by dws (Chancellor) on May 06, 2002 at 16:17 UTC
    This is a human problem, not a technical problem.

    The best you can do is to gradually reduce the number of people you're pissing off by mangling their names. And unfortunately, the cost of that gradual reduction is steep.

    Consider your example of "DeVelera". A simple algorithm might get that right, but will mangle "DeMming". A more sophisticated (and expensive) algorithm might still mangle "DeVerioux". The next step is to ramp up expense by adding a name dictionary, but that isn't going to be sufficient for all cases, either.

Re: Handling caps for surnames with capitals in the middle (was: Irish Surnames)
by Super Monkey (Beadle) on May 06, 2002 at 19:08 UTC
    Obviously, there is not a set of rules for how people spell their names, first, last, or otherwise. Therefore, there is not going to be a simple solution you can implement using a computer. You have to be able to define a set of rules for a system in order to describe that system. When new names are added to the db, you will have to define new rules for how to handle the names. This will lead to a one:one relationship between rule and names, meaning that the name itself is actually the rule! The simplest solution is to store them case-sensitively in the database.
Re: Handling caps for surnames with capitals in the middle (was: Irish Surnames)
by mr_mischief (Monsignor) on May 06, 2002 at 20:49 UTC
    My name is consistently anti-corrected in spelling. It is very, very annoying and actually quite offensive and somewhat angering to have someone try to change not only my name but also the country and time of my family's roots. I know how to spell my name. If your receptionist, your spell checker, your billing department, your court clerk, or anyone else decides to tell you to contact Christopher Smith, I am quite likely to give you the response that I am not Christopher Smith. If I'm in a friendly mood that day, I may point out that my name is Stith.

    There's a big difference between an modern English name from the late 1400's and a Welsh name from well before that which was taken by someone emigrating from the Black Sea area. There is similarly a big difference among MacGwire, MacGuire, Maguire, etc. The same thing I'm sure is true for names from places besides western Europe.

    If you don't know how to format a name, how to spell it, or whether or not it is mistyped, it should only be stored as provided by the user. Anything else is a raw insult to someone's family and heritage.

    Christopher E. Stith
    Do not try to debug the program, for that is impossible. Instead, try only to realize the truth. There is no bug. Then you will find that it is not the program that bends, but only the documentation.

Re: Handling caps for surnames with capitals in the middle (was: Irish Surnames)
by tmiklas (Hermit) on May 06, 2002 at 21:40 UTC
    Maybe try building an array with prefixes and then check if surname begins with one of those prefixes. Yes - upcase the nex letter (hmmm - strip prefix, ucfirst, add prefix at the beginning or something); No - ucfirst (if needed).

    Greetz, Tom.
Re: Handling caps for surnames with capitals in the middle (was: Irish Surnames)
by Maclir (Curate) on May 07, 2002 at 19:26 UTC
    And any text munging routine will need to handle embedded punctuation. For example:
    • O'Rourke
    • Hackforth-Jones
    I guess the best rule is - an individual knows how they want their name spelt. Use exactly what they give you.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://164280]
Approved by tye
Front-paged by astaines
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chanting in the Monastery: (11)
As of 2014-08-27 14:04 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The best computer themed movie is:











    Results (238 votes), past polls