Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things

Re: Comparing inputs before inserting into MySQL for "sounds-like" similarity

by jhourcle (Prior)
on Apr 27, 2008 at 17:14 UTC ( #683160=note: print w/replies, xml ) Need Help??

in reply to Comparing inputs before inserting into MySQL for "sounds-like" similarity

Soundex is a standard for 'sound alike' of the type that the name 'chris' sounds like 'kris' or 'joseph' sounds like 'josef' -- I've typically seen it used for telephone book type applications, so that you can match the various americanized spellings of names from other languages.

The advantage of soundex is that it essentially provides a hash of the value, and when you get a new input, you compute the hash and look to see if anything in the system already had a hash that matches ... the other items you mentioned require comparing the string against all of the other existing strings in the system, which just doesn't scale well.

If you were doing this with just single words, I'd probably look at the concept of stemming, where for each word, you compute the root of the word, so you can then check to ensure you're not storing 'boat' and 'boats' and 'boating', etc. (all stemming routines are different, and are language dependent ... some just handle plurals, others do more). It still might be a useful concept to look into to see what sort of processes are done, as you may wish to make use of it in your solution.

For the examples you gave, the second one could be solved by just stemming each word -- you might be able to strip some adjectives, adverbs or other less significant modifiers, but you're going to start getting into issues of word order when computing your hashes -- in your first example, someone might've entered 'Houston New Home Builders' ... but there might be valid reason for duplication, as language is imprecise ... is this an article about builders who make new homes (which is kinda redundant, I would think), or about home builders who didn't exist previously? If the later, than you might have the article reoccur every year or two with different information.

If you're just going for an attempt to identify plagiarism, you might compute some form of statistics on individual sentences / paragraphs (eg, like the number of times each word appears), and then see if anything already matches that item ... but you're have to balance the rate of false positives / false negatives. ... and I haven't even considered your request to not add / modify tables -- I just don't think it's realistic for what you're trying to do

  • Comment on Re: Comparing inputs before inserting into MySQL for "sounds-like" similarity

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://683160]
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (4)
As of 2018-06-20 01:05 GMT
Find Nodes?
    Voting Booth?
    Should cpanminus be part of the standard Perl release?

    Results (116 votes). Check out past polls.