Methodology for matching in hashes

rsiedl has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,

I'm after a bit of advice/suggestions on how to best achieve the following:

I have a list of names + initials that I pull from a database and want to check to see if there could be similar names in the db...

smith j could match smith jn
mayer-handel c could match mayer handel c
jones t jnr could match jones t
etc...

I can only think of two methods for doing this:

1. do a select using mysql - my problem here is i dont know how to do such a complex select (or even if it could be done)
2. select all names, store them in a hash and then foreach author scroll through the hash running a "match sub" on each name to see if it matches

# Do some matching
foreach my $auth_id (keys %authors) {
  foreach my $mid (keys %authors) {
        print "$authors{$auth_id}{'author'} \t matches \t $authors{$mi
+d}{'author'}\n" if ( &matchme( $authors{ $auth_id}, $authors{$mid} ) 
+);
  } # end-foreach
} # end-foreach

sub matchme {
  # return true for a match otherwise false.
} # end-sub
[download]

2 doesnt seem like the most efficient way of doing things, (10 000 records = 100 000 loops) so I was after any other ideas about the path to take?
Any help would be appreciated.

Cheers,
Reagen

Comment on Methodology for matching in hashes Download Code

Replies are listed 'Best First'.
Re: Methodology for matching in hashes by rodion (Chaplain) on Jul 13, 2006 at 01:59 UTC
I suggest going canonical. You decide (or decide, try and decide again) what things are equivalent, and generate a function to transform all names into a canonical form. If two names colapse onto the same canonical form, then they are at least probably equivalent. If needed, you can then do whatever further tests on the probable matches to find the ones that are definite matches. In your example, the canonical() sub would maybe replace hyphens in last names by spaces, and eliminate the last initial. The sub doesn't need to return just one cannonical form either, it can return an array of them for different types of matches. Once you've done that the rest is fairly strightforward. `while ($name = shift @inp_names) { for $canon ((canonical($name))) { #extra parens force list # add $name to an anon array, stored under the $canon key push @{$possible_matches{$canon}}, $name; } } for $canon (keys %possible_matches) { print "matches under $canon= ", join(',', @{$possible_matches{$canon}}) "\n"; } # for illustration. untested` [download] Hope this serves to get you started. Happy matching.	[reply] [d/l]
Re: Methodology for matching in hashes by Dervish (Friar) on Jul 13, 2006 at 05:06 UTC
There is also the soundex algorithm, which allows you to search for matches with similar sounds ("Fry" vs "Frye", for example). I don't know if that's something you want to check for, but someone has to have implemented it in Perl by now.	[reply]
Re: Methodology for matching in hashes by NiJo (Friar) on Jul 13, 2006 at 19:22 UTC
String::Approx would be a perfect match for your job.	[reply]


Keep It Simple, Stupid
	PerlMonks