Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

How to get the required names

by Anonymous Monk
on Dec 25, 2007 at 16:18 UTC ( #658970=perlquestion: print w/ replies, xml ) Need Help??
Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi All,

We get a file containing names and the task is to extract those names that are already available in the names table. The issue is the names (e.g. Maggie Smith) in the file comes in different formats like the below ones (and also others)

Maggie Smith
Maggie-Smith
Maggie.Smith
Maggie_Smith
Maggie (more than one Blank Space) Smith

Thanks for your help.

Regards

Comment on How to get the required names
Re: How to get the required names
by Limbic~Region (Chancellor) on Dec 25, 2007 at 16:50 UTC
    Anonymous Monk,
    It is quite simple with two assumptions:
    • The names never come in like MaggieSmith
    • There are no compound first names like "Bubba-Joe Smith"
    All you need to do is split the name string on non-alpha chars and check each name part:
    my @name_part = grep {defined && /[a-zA-Z]/} split /[^a-zA-Z]/, $name;

    This of course is probably unrealistic. In that case, you will need to do some actual parsing not just regex matching. If the above suggestion doesn't work, come back with specifics and we will see if we can't be more helpful.

    Cheers - L~R

      Assuming the OP just needs a name for lookup (rather than separate firstname and lastname), I would go the opposite direction:
      my $normalized_name = lc $name; $normalized_name =~ tr/a-z//cd; # or =~ s/[^a-z]//g
      All of the examples from the initial post would then become identical maggiesmiths. This would also work for the input MaggieSmith and Bubba-Joe Smith (producing bubbajoesmith) and for names which differ in cApiTaLiZatiOn.

      But, like you said, the specifics are a little sparse, so this solution may also be unsuitable.

Re: How to get the required names
by hawtin (Prior) on Dec 26, 2007 at 11:53 UTC

    As long as the "spacer" characters are not alphabetic the task is, surely, simple:

    $input_str =~ s/\W+/ /g; $db_str =~ s/\W+/ /g; if(lc($input_str) eq lc($db_str)) ...

    Of course in the real case you will also want to remove spaces at the start and end of the string etc.

    In the production case this is a much more difficult problem since you will also want to find:

    MaggieSmith Magie Smythe M Smith Marge Smith Maggie J Smith etc

    In a real case good name matching is not trivial and almost always needs fuzzy matching with some combination of Soundex, known alaises and clever processing. There is stuff about this in other nodes.

      Thanks a lot for all your help..

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://658970]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others examining the Monastery: (6)
As of 2014-07-29 02:12 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (211 votes), past polls