http://www.perlmonks.org?node_id=389603

bilbo800 has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks

A classic html scraping problem, but could not find a spot on solution.

I have a list of products which I collect online prices for from different on line sites. Problem is on some sites the product is spelled ‘sony dhc15E’, on some it’s just ‘dhc15-E’ and on some it’s ‘Sony camcoder 15-e’.

Assuming I have a list of 200 such products, how do I approach the problem of taking the product name from the merchant site and finding the best suitable match in my array of product.

Thx

Oron
  • Comment on Deciding which word in an array is the closest match to a given word

Replies are listed 'Best First'.
Re: Deciding which word in an array is the closest match to a given word
by eserte (Deacon) on Sep 09, 2004 at 09:51 UTC
    Try either String::Approx or String::Similarity. The both modules are different in their approach: the former returns all matches based on an "error" or "fuzziness" parameter while the latter returns a similarity factor of two strings. Both can be tailored for your needs.
      No, that won't work because the modules are too general. "dhc15-E" and "Sony camcoder 15-e" should be a match, but something like "dhc16-E" should not match, as that will be a camcorder of a different type. But the modules you mention won't have the knowledge what the strings mean, and will consider "dhc15-E" and "dhc16-E" quite similar - as they differ by only one character.
        This looks like you need to put some custom logic in it. From your example it looks like the substring "dhc" might be substituted with "Sony camcorder". Maybe you can try to use a number of such mappings to get a canonical form. I can also imagine that dashes and spacing may differ, so strip all non-characters before comparing.
Re: Deciding which word in an array is the closest match to a given word
by Random_Walk (Prior) on Sep 09, 2004 at 10:52 UTC
    Perhaps you need a mixture of the two suggested approaches. When the best possible match is found by one of the string aproximation packages comparing to all matches already known you asign it as a best guess match. You also add this match to a list for human review of strings that were matched and the cannonical product name. Once a human reviewer agrees a match is good it goes into the hash of know matches

    You will never get 100% as some very different products may be given the same name (e.g. an F15 could be an aircraft or a sunscreen)

    Cheers,
    R.

Re: Deciding which word in an array is the closest match to a given word
by wfsp (Abbot) on Sep 09, 2004 at 09:57 UTC
    Do you know all the variations in advance? If you do I would suggest a look up table (hash). The keys would be all the variations you would expect and the value would be whatever is in your array.