Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

abbreviation checking

by shemp (Deacon)
on Dec 02, 2002 at 18:09 UTC ( #216998=perlquestion: print w/replies, xml ) Need Help??

shemp has asked for the wisdom of the Perl Monks concerning the following question:

I am working on an app that tries to parse text that may be very poorly formatted. The text is mainly public record data, and was mostly entered into various systems many years ago, by people who never thought about algorithmic parsing. They jsut entered the text into the field with all sorts of goofy abbreviations.
My algorithm is trying to clean up some of the inconsistencies, for instance the word "Trust" may be abbreviated as "Trst", "Tst", "Tru", etc.
So, im thinking of using String::Approx to match potential abbreviated words against a pre-determined list. Would this be wise, or does anyone know of a module more specific to my needs?

thanks

Replies are listed 'Best First'.
Re: abbreviation checking
by gjb (Vicar) on Dec 02, 2002 at 19:05 UTC

    Frankly, I don't think that Text::Soundex is the way to go, it is most probably too coarse a measure for what you want to achieve.

    Apart from String::Approx there's also Text::WagnerFischer and Text::Levenshtein, Both are string distance measures. I'd use a gradual refinement, starting with a large threshold value to collect candidates to be replaced and taking smaller values until the noise level is acceptable. If the lists of words are not too large I'd valiidate it manually so that each word corresponds to the right set of abbreviations. After that, it's a simple matter of substituting strings.

    Just my 2 cents, -gjb-

    Update: You might also want to have a look at Text::KeyboardDistance to catch typos.

Re: abbreviation checking
by dree (Monsignor) on Dec 02, 2002 at 20:34 UTC
      While making an MP3-renaming script, which attacks a problem similar to yours, I used a combination of Metaphone and "distance" modules. My approach:
      1. Get a list of "known-good" words. I use already-verified MP3 filenames as a source of these.
      2. Calculate their Metaphones.
      3. Calculate the Metaphone of any new words and look for matches. If none, see if there are any matches with a distance of 1 or 2. Distances larger than 2 produce too many matches.
      4. Have the user confirm the 'corrections'.
      It's not an exact science, and human intervention is unavoidable if correctness matters.
Re: abbreviation checking
by BrowserUk (Pope) on Dec 02, 2002 at 18:16 UTC

    Text::Soundex might help you.


    Okay you lot, get your wings on the left, halos on the right. It's one size fits all, and "No!", you can't have a different color.
    Pick up your cloud down the end and "Yes" if you get allocated a grey one they are a bit damp under foot, but someone has to get them.
    Get used to the wings fast cos its an 8 hour day...unless the Govenor calls for a cyclone or hurricane, in which case 16 hour shifts are mandatory.
    Just be grateful that you arrived just as the tornado season finished. Them buggers are real work.

Re: abbreviation checking
by Anonymous Monk on Dec 03, 2002 at 00:41 UTC
    I have a similuar problem expect it deals with names and address.
    You know
    John Smith 123 North East First St. Doofus Ville, GA 31314
    Should match
    J. Smith 123 NE 1 St. Doofus Ville, Geogria 31314
    Anybody got any good packages for that?

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://216998]
Approved by mikeirw
Front-paged by rbc
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others contemplating the Monastery: (4)
As of 2019-06-24 23:29 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Is there a future for codeless software?



    Results (100 votes). Check out past polls.

    Notices?