Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

Re: Approximate matching of company names

by the_0ne (Pilgrim)
on Oct 19, 2003 at 18:36 UTC ( #300413=note: print w/ replies, xml ) Need Help??


in reply to Approximate matching of company names
in thread Some kind of fuzzy logic.

A different approach, which worked better for me, was to make lists of all the substrings of length n in the source string. I called these n-tuples. I compared the percentage overlap between the n-tuple sets for each name in one list to the n-tuples for each word in the other list. The best value for the length n of the tuples was three or four.

Toma, could you give me an example of what you mean by this paragraph? I don't want you to go to the trouble of code examples, I mean an example using text so I can better understand what you mean.

Thanks...


Comment on Re: Approximate matching of company names
Replies are listed 'Best First'.
Re: Re: Approximate matching of company names
by toma (Vicar) on Oct 30, 2003 at 04:26 UTC
    Here is the requested example that shows how n-tuples work:

    In this example, take n=3. The company names to compare are "tomacorp" and "tomarcorp". The 3-tuples of tomacorp are:

    tom oma mac aco cor orp
    The 3-tuples of tomarcorp are:
    tom oma mar arc rco cor orp
    The tuples in common are:
    tom oma cor orp
    Four of the six 3-tuples in tomacorp appear in tomarcorp. This is a 75% match.

    It should work perfectly the first time! - toma
      Is there a perl module or modules that implement this approach?

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://300413]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others contemplating the Monastery: (16)
As of 2015-07-07 18:30 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (93 votes), past polls