Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

running riot with an regx on surnames.

by maderman (Beadle)
on Nov 27, 2001 at 11:03 UTC ( [id://127749] : perlquestion . print w/replies, xml ) Need Help??

maderman has asked for the wisdom of the Perl Monks concerning the following question:

Hi all, I have the following regx to filter out surnames such as Joe Bloggs, Joe De Blogg and Joe Bloggs-Night.
use strict; if ($reporter !~ m/(\w+)\s*(.*)/) { die "Name should be in wrong format"; } else { $first = $1; $last = $2; #remove leading/trailing whitespace foreach ($first,$last) { s/^\s+//; s/\s+$//; } if ($first !~ m/[A-Za-z]/) { die "First name should only contain letters!"; } if ($last =~ m/\d/g) { die "Surname should only contain letters, hyphens and apostrophe +s!"; } elsif ($last =~ m/([A-Za-z\'\-]+)/) { $surname = $1; } elsif ($last =~ m/(\w+)\s*(\w+)/) { $surname = join(' ', $1, $3); } else { die "Wrong format in surname!"; }
The problem with the above is that for names like Joe de Bloggs, I get "J De" returned and not "J De Bloggs". A little help here? Many thanks in advance. Stacy.

Replies are listed 'Best First'.
Parsing english names
by boo_radley (Parson) on Nov 27, 2001 at 11:15 UTC
Re: running riot with an regx on surnames.
by trantor (Chaplain) on Nov 27, 2001 at 14:46 UTC

    This is one of those situations where regular expressions can actually complicate things, why not use a simple split instead?

    Also keep in mind that sometime first names can be double or contain non alphabetic characters, like Anne Marie or Anne-Marie (I've seen them both), and the first name is actually Anne Marie, Marie is not the middle.

    Besides the middle name problem, that could be objectively hard to solve programmatically, there are several inconsistencies in your script:

    $reporter !~ m/(\w+)\s*(.*)/

    \s* matches zero or more blanks which is (I presume) not what you want, because you want to check that you have at least first name made by alphabetcs only (check the meaning of \w in perlre! it includes characters you don't want). For example, it happily accepts "name" when it should complain, looking for something like "name surname". Also, you my want to look at Death to Dot Star! by the excellent Ovid

    foreach ($first,$last) { s/^\s+//; s/\s+$//; }

    Unnecessary for the first name, because of the regexp you've used to capure it. Again, using split, this would be totally unnecessary for the last name as well, even if double.

    if ($last =~ m/\d/g) { die "Surname should only contain letters, hyphens and apostrop +hes!"; }

    The regexp and the error message state two different things. What you say in the error message (which is correct, I assume) would be a pattern like [^A-Za-z'-]. This is avery common mistake when deciding what's good and what's not, unfortunately it leads to so many security holes in programs. Please note that this is by no means complete, because for example it doesn't consider names with accented letters in them, like Björn.

    elsif ($last =~ m/(\w+)\s*(\w+)/) { $surname = join(' ', $1, $3); }

    Same note for \s*, also note that you're capturing in $2 and not $3

    A starting point with split would be something like:

    #!/usr/bin/perl -w use strict; my $n = 'Joe De Blogg'; my($name, $surname) = split /\s+/, $n, 2; print "name: $name surname: $surname\n";

    -- TMTOWTDI

Re: running riot with an regx on surnames.
by Dogma (Pilgrim) on Nov 27, 2001 at 12:16 UTC

    Since you're going to use the CPAN module for this anyways (Right?) I'll try to explain what I suspect is happening here. Your first regex is matching part of the string and returning sucessfully. This is because you didn't specify any anchors in your re.

    try replacing...

    m/([A-Za-z\'\-]+)/

    with...

    m/([A-Za-z\'\-]+)$/

    Now the re won't beable to return sucessful because it can't match all the way to the end of the line. If you still don't understand whats going on here please read perldoc perlre.

    I didn't test this so I hope it works for you.

    Edited by footpad, ~Wed Nov 28 05:30:18 2001

Re: running riot with an regx on surnames.
by jarich (Curate) on Nov 27, 2001 at 15:55 UTC
    There are a lot of good suggestions above this, however if you want to know why this specific apprach to solving this problem doesn't work, the problem is in this line:
    elsif ($last =~ m/([A-Za-z\'\-]+)/) { $surname = $1; }
    If you're looking at the name "Joe De Bloggs" then $last, by this point equals "De Bloggs". Once you do this match (and it will match as you do have one or more of those characters in $last), $surname gets set to "De".

    If you swapped the two conditions so that you had:

    elsif($last =~ m/(\w+)\s+(\w+)/) { # note the + here. $surname = join(' ', $1, $2); # note change to $2 } elsif($last =~ /([A-Za-z\'\-]+)/) { $surname = $1; }
    Then $surname will end up with "De Bloggs" as you desire.

    Good luck.

(joealba) Devil's advocate time...
by joealba (Hermit) on Nov 27, 2001 at 19:41 UTC
    What about Sting? The artist once again known as Prince? Cher? (okay, forget about Cher.)

    What about William Jones PhD. and his wife Mrs. Judy Jones?

    Your best best is to go with Lingua::EN::NameParse as boo_radley says. Funny, I could have sworn that this module was written by TheDamian, but it's another Aussie. Probably lives down the street. :)