Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Methodology for matching in hashes

by rsiedl (Friar)
on Jul 13, 2006 at 00:56 UTC ( [id://560850]=perlquestion: print w/replies, xml ) Need Help??

rsiedl has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,

I'm after a bit of advice/suggestions on how to best achieve the following:

I have a list of names + initials that I pull from a database and want to check to see if there could be similar names in the db...
smith j could match smith jn
mayer-handel c could match mayer handel c
jones t jnr could match jones t
etc...
I can only think of two methods for doing this:

1. do a select using mysql - my problem here is i dont know how to do such a complex select (or even if it could be done)
2. select all names, store them in a hash and then foreach author scroll through the hash running a "match sub" on each name to see if it matches

# Do some matching foreach my $auth_id (keys %authors) { foreach my $mid (keys %authors) { print "$authors{$auth_id}{'author'} \t matches \t $authors{$mi +d}{'author'}\n" if ( &matchme( $authors{ $auth_id}, $authors{$mid} ) +); } # end-foreach } # end-foreach sub matchme { # return true for a match otherwise false. } # end-sub
2 doesnt seem like the most efficient way of doing things, (10 000 records = 100 000 loops) so I was after any other ideas about the path to take?
Any help would be appreciated.

Cheers,
Reagen

Replies are listed 'Best First'.
Re: Methodology for matching in hashes
by rodion (Chaplain) on Jul 13, 2006 at 01:59 UTC
    I suggest going canonical. You decide (or decide, try and decide again) what things are equivalent, and generate a function to transform all names into a canonical form. If two names colapse onto the same canonical form, then they are at least probably equivalent. If needed, you can then do whatever further tests on the probable matches to find the ones that are definite matches.

    In your example, the canonical() sub would maybe replace hyphens in last names by spaces, and eliminate the last initial. The sub doesn't need to return just one cannonical form either, it can return an array of them for different types of matches.

    Once you've done that the rest is fairly strightforward.

    while ($name = shift @inp_names) { for $canon ((canonical($name))) { #extra parens force list # add $name to an anon array, stored under the $canon key push @{$possible_matches{$canon}}, $name; } } for $canon (keys %possible_matches) { print "matches under $canon= ", join(',', @{$possible_matches{$canon}}) "\n"; } # for illustration. untested
    Hope this serves to get you started. Happy matching.
Re: Methodology for matching in hashes
by Dervish (Friar) on Jul 13, 2006 at 05:06 UTC
    There is also the soundex algorithm, which allows you to search for matches with similar sounds ("Fry" vs "Frye", for example). I don't know if that's something you want to check for, but someone has to have implemented it in Perl by now.
Re: Methodology for matching in hashes
by NiJo (Friar) on Jul 13, 2006 at 19:22 UTC
    String::Approx would be a perfect match for your job.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://560850]
Approved by McDarren
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others learning in the Monastery: (5)
As of 2024-04-19 23:22 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found