comment on

Soundex is a standard for 'sound alike' of the type that the name 'chris' sounds like 'kris' or 'joseph' sounds like 'josef' -- I've typically seen it used for telephone book type applications, so that you can match the various americanized spellings of names from other languages.

The advantage of soundex is that it essentially provides a hash of the value, and when you get a new input, you compute the hash and look to see if anything in the system already had a hash that matches ... the other items you mentioned require comparing the string against all of the other existing strings in the system, which just doesn't scale well.

If you were doing this with just single words, I'd probably look at the concept of stemming, where for each word, you compute the root of the word, so you can then check to ensure you're not storing 'boat' and 'boats' and 'boating', etc. (all stemming routines are different, and are language dependent ... some just handle plurals, others do more). It still might be a useful concept to look into to see what sort of processes are done, as you may wish to make use of it in your solution.

For the examples you gave, the second one could be solved by just stemming each word -- you might be able to strip some adjectives, adverbs or other less significant modifiers, but you're going to start getting into issues of word order when computing your hashes -- in your first example, someone might've entered 'Houston New Home Builders' ... but there might be valid reason for duplication, as language is imprecise ... is this an article about builders who make new homes (which is kinda redundant, I would think), or about home builders who didn't exist previously? If the later, than you might have the article reoccur every year or two with different information.

If you're just going for an attempt to identify plagiarism, you might compute some form of statistics on individual sentences / paragraphs (eg, like the number of times each word appears), and then see if anything already matches that item ... but you're have to balance the rate of false positives / false negatives. ... and I haven't even considered your request to not add / modify tables -- I just don't think it's realistic for what you're trying to do

In reply to Re: Comparing inputs before inserting into MySQL for "sounds-like" similarity by jhourcle
in thread Comparing inputs before inserting into MySQL for "sounds-like" similarity by hacker

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Clear questions and runnable code get the best and fastest answer
	PerlMonks