Problems? Is your data what you think it is? | |
PerlMonks |
Re^2: Fuzzy matching of text stringsby buttroast (Scribe) |
on Dec 14, 2005 at 19:22 UTC ( [id://516740]=note: print w/replies, xml ) | Need Help?? |
Soundex is a great tool, but in this case it is not doing anything. The reason the first four descriptions in your sample return the same soundex code is because they only processed the "Promess" portion of each record.
Basically: 1. Grab the first letter: String: Promessa H... Soundex: P 2. Remove all vowels in remaining string: String: rmssH Soundex: P 3. Condense duplicate letters: String: rmsH Soundex: P 4. Assign 3 digits from l-r based on following key: 1. b,p,f,v 2. c,s,k,g,i,q,x,z 3. d,t 4. l 5. m,n 6. r String: rmsH Soundex: P6 (6 is for r) String: msH Soundex: P65 (5 is for m) String: sH Soundex: P652 (2 is for s) DONE AT 3 DIGITS!!! GO NO FURTHER. If there are consecutive characters from the same group, such as in the name "Duck", (c and k are both in group 2), the resulting soundex would be D200 (zeros are added to pad right if we run out of letters to change to numbers). In summary, soundex is not appropriate for longer strings comparison. If you use it, the following would all be grouped as P652: Promessa National Bank Promessing Fertilizer Company Promessa High Spirits Promessing With Me Hope this clears up Soundex for everyone.
Thanks
buttroast
In Section
Seekers of Perl Wisdom
|
|