Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

text string approxiamtions (concept for review)

by shemp (Deacon)
on Dec 04, 2002 at 23:48 UTC ( #217628=perlquestion: print w/ replies, xml ) Need Help??
shemp has asked for the wisdom of the Perl Monks concerning the following question:

A lot of background info here, the real question is, what do people think of the algorithm i'm about to describe, (including the hash of arrayrefs).
I'm working on a name parser, where the name information is really ownership information. This is not very clean, in the sense that "people names" and other name-related info, such as business names, and ownership information are mixed together. For simplicity, lets just say im trying to identify if a string is "people name(s)" or other info
So, i have a list of keywords that deonte non-people names, such as "Associates, Corp, Trust, Rental", etc. Now, the problem is that the name info im trying to parse is pretty ugly, in that there are often (non-standard) abbreviations, particularly in the business words. So, I've tried Lingua::Stem, with limited success, as well as some home-cooked algorithms. I want to try applying some Levenshtein like logic also, but this needs to be fast, as i process millions of names, and what is being described here is only a (logically) small part of the process.
So, i'm thinking of storing my keyword list as follows. Since each keyword will be at least 3 chars, a hash of arrayrefs seems good, where the top-level keys are the first 2 letters of each keyword, and the arrayref associated contains the remaining letters of the words. for example:
my %list = ( "ass" => ["ociation", "iates", "embly"], "res" => ["ource", "tatement", "taurant"], etc );
With this, i can substring the beginning of each token in the name string, and use %list to find possible keywords matches, then use a Levenshtein distance measure (and other algorithms) to determine the likelihood of what the information is supposed to represent.

A cool thing that hashes do not do, would be to allow a regex for a key. This way, one wouldnt have to cycle through the keys, and apply the regex to each. The Perl internal hashing algorithm could efficiently apply the regexing, so we could do something like:
my @matches = $hash_var{/^res/};

Comment on text string approxiamtions (concept for review)
Select or Download Code
Re: text string approxiamtions (concept for review)
by BrowserUk (Pope) on Dec 05, 2002 at 00:18 UTC

    Minor nit. As your driving for efficiency you'd be better off not copying the array each time, just get and use the reference.

    my $matches = $hash_var{/^res/};

    You'd still need to listify the array in order to pass it to Text::Levenshtein::distance(), but better to do it once rather than twice.

    You also might need to store the whole words in the list instead of just the suffixes as that algorthm needs that. Unless you intend re-implementing your own version, in which case you could gain/save by using passing the stem and the ref to the array of suffixes.


    Okay you lot, get your wings on the left, halos on the right. It's one size fits all, and "No!", you can't have a different color.
    Pick up your cloud down the end and "Yes" if you get allocated a grey one they are a bit damp under foot, but someone has to get them.
    Get used to the wings fast cos its an 8 hour day...unless the Govenor calls for a cyclone or hurricane, in which case 16 hour shifts are mandatory.
    Just be grateful that you arrived just as the tornado season finished. Them buggers are real work.

Re: text string approxiamtions (concept for review)
by gjb (Vicar) on Dec 05, 2002 at 00:25 UTC

    You may want to have a look at Tie::Hash::Approx and Tie::Hash::Regex. I'm not sure whether the latter is faster than a manual loop, but it might be worth a try.

    Hope this helps, -gjb-

Re: text string approxiamtions (concept for review)
by seattlejohn (Deacon) on Dec 05, 2002 at 03:48 UTC
    Be sure to look at String::Approx too.

    A cool thing that hashes do not do, would be to allow a regex for a key.

    I can't speak for earlier versions, but you can definitely use regexes as hash keys in 5.6.1:

    use strict; my %tests = ( qr/^[0-9]+$/ => 'positive integer', qr/co/i => 'company/corporation', ); foreach my $term qw(12345 Company Jones) { foreach my $key (keys %tests) { print "$term is a $tests{$key}\n" if $term =~ $key; } }

            $perlmonks{seattlejohn} = 'John Clyman';

      I poorly stated my intentions for regexs with hashes. What i want is to be able to use a regex as a lookup of values in the hash. For instance:
      my %names = ( "sean" => 29, "lisa" => 14, "bob" => 25, "matt" => 23, "tom" => 52 ); my @matches = %names{/a/};
      Here, the regex /a/ would match the keys "sean", "lisa", and "matt".
      so matches would contain the cooresponding values 29, 14, 23

      It would also be cool to be able to get the matched keys, so something like
      my @keys = keys %names{/a/};
      Then @keys would contain the array of matched keys: "sean", "lisa", "matt"

      I'm making up the syntax, but i think this better describes what i would like to be able to do.

        grep and map can probably help. First, to get a subset of hash keys:
        my @a_keys = grep {/a/} (keys %names);

        Then to get their values:
        my @a_values = map {$names{$_}} grep {/a/} (keys %names);

        You could also get the job done using hash slices, like so:
        my @a_values = @names{grep {/a/} keys %names};

        That's more like the syntax you proposed, though I find it a bit disconcerting (maybe because I don't find the @names{...} hash-slice notation particularly natural in the first place).

                $perlmonks{seattlejohn} = 'John Clyman';

        Update: only answered half the question in my original reply. Fixed here.

Re: text string approxiamtions (concept for review)
by hawtin (Prior) on Dec 05, 2002 at 23:48 UTC

    Have you looked at Soundex? It is the way that genealogists "approximately match" surnames (a few hundred years ago surname spellings were more variable than they are today). There is even a Text::Soundex CPAN module and a thread about it here

    Thought it might be of interest

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://217628]
Approved by BrowserUk
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others scrutinizing the Monastery: (7)
As of 2014-07-26 06:36 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (175 votes), past polls