comment on

A lot of background info here, the real question is, what do people think of the algorithm i'm about to describe, (including the hash of arrayrefs).
I'm working on a name parser, where the name information is really ownership information. This is not very clean, in the sense that "people names" and other name-related info, such as business names, and ownership information are mixed together. For simplicity, lets just say im trying to identify if a string is "people name(s)" or other info
So, i have a list of keywords that deonte non-people names, such as "Associates, Corp, Trust, Rental", etc. Now, the problem is that the name info im trying to parse is pretty ugly, in that there are often (non-standard) abbreviations, particularly in the business words. So, I've tried Lingua::Stem, with limited success, as well as some home-cooked algorithms. I want to try applying some Levenshtein like logic also, but this needs to be fast, as i process millions of names, and what is being described here is only a (logically) small part of the process.
So, i'm thinking of storing my keyword list as follows. Since each keyword will be at least 3 chars, a hash of arrayrefs seems good, where the top-level keys are the first 2 letters of each keyword, and the arrayref associated contains the remaining letters of the words. for example:

my %list = (
   "ass"   =>   ["ociation", "iates", "embly"],
   "res"   =>   ["ource", "tatement", "taurant"],
   etc
);
[download]

With this, i can substring the beginning of each token in the name string, and use %list to find possible keywords matches, then use a Levenshtein distance measure (and other algorithms) to determine the likelihood of what the information is supposed to represent.

A cool thing that hashes do not do, would be to allow a regex for a key. This way, one wouldnt have to cycle through the keys, and apply the regex to each. The Perl internal hashing algorithm could efficiently apply the regexing, so we could do something like:

my @matches = $hash_var{/^res/};
[download]

In reply to text string approxiamtions (concept for review) by shemp

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


good chemistry is complicated, and a little bit messy -LW
	PerlMonks