Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid

How to get the required names

by Anonymous Monk
on Dec 25, 2007 at 16:18 UTC ( #658970=perlquestion: print w/replies, xml ) Need Help??
Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi All,

We get a file containing names and the task is to extract those names that are already available in the names table. The issue is the names (e.g. Maggie Smith) in the file comes in different formats like the below ones (and also others)

Maggie Smith
Maggie (more than one Blank Space) Smith

Thanks for your help.


Replies are listed 'Best First'.
Re: How to get the required names
by Limbic~Region (Chancellor) on Dec 25, 2007 at 16:50 UTC
    Anonymous Monk,
    It is quite simple with two assumptions:
    • The names never come in like MaggieSmith
    • There are no compound first names like "Bubba-Joe Smith"
    All you need to do is split the name string on non-alpha chars and check each name part:
    my @name_part = grep {defined && /[a-zA-Z]/} split /[^a-zA-Z]/, $name;

    This of course is probably unrealistic. In that case, you will need to do some actual parsing not just regex matching. If the above suggestion doesn't work, come back with specifics and we will see if we can't be more helpful.

    Cheers - L~R

      Assuming the OP just needs a name for lookup (rather than separate firstname and lastname), I would go the opposite direction:
      my $normalized_name = lc $name; $normalized_name =~ tr/a-z//cd; # or =~ s/[^a-z]//g
      All of the examples from the initial post would then become identical maggiesmiths. This would also work for the input MaggieSmith and Bubba-Joe Smith (producing bubbajoesmith) and for names which differ in cApiTaLiZatiOn.

      But, like you said, the specifics are a little sparse, so this solution may also be unsuitable.

Re: How to get the required names
by hawtin (Prior) on Dec 26, 2007 at 11:53 UTC

    As long as the "spacer" characters are not alphabetic the task is, surely, simple:

    $input_str =~ s/\W+/ /g; $db_str =~ s/\W+/ /g; if(lc($input_str) eq lc($db_str)) ...

    Of course in the real case you will also want to remove spaces at the start and end of the string etc.

    In the production case this is a much more difficult problem since you will also want to find:

    MaggieSmith Magie Smythe M Smith Marge Smith Maggie J Smith etc

    In a real case good name matching is not trivial and almost always needs fuzzy matching with some combination of Soundex, known alaises and clever processing. There is stuff about this in other nodes.

      Thanks a lot for all your help..

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://658970]
Approved by Corion
[tye]: oh, LanX, but I was thinking that it was not a function that Perl provided.
[Corion]: Yeah, I also went a more roundabout way, just to find that the solution had been with Perl all along! ;)
[tye]: perhaps the "less secure" comment was motivated by old versions of getlogin() and trolled through the 'last' log trying to match your TTY. On modern Unix, I believe getlogin() just returns a fundamental bit of identity from your process.
[tye]: (Because every thing you do has that tag available for auditd.)
[tye]: Though it is certainly true that you should not use getlogin() for auth().
[LanX]: tye: just a tip for the next time, I found interesting things there...
[LanX]: like shmem commands : shmctl, shmget, shmread, shmwrite...

How do I use this? | Other CB clients
Other Users?
Others studying the Monastery: (6)
As of 2017-06-23 18:58 GMT
Find Nodes?
    Voting Booth?
    How many monitors do you use while coding?

    Results (554 votes). Check out past polls.