I've a question about finding and mangling text across the web. The solution will involve perl, LWP, etc. Before diving into the perl, I'm seeking advice on strategy.
I work for a company with a very content-rich website that contains detailed product information. Many fly-by-night small operators steal this text to describe the same products they're selling on Yahoo stores. This intellectual property theft is so egregious that Yahoo quickly shuts down these small sites when presented with evidence. (Usually they aren't even sophisticated to remove our brand name from their copy.) But they quickly pop back up again with new names.
Can anyone suggest good methods to mangle our text via HTML tags, entities, CSS, etc. so that it looks normal to a human browser, but foils the spiders and robots who steal it verbatim? The mangling would have to have some randomness to it, so that a simple script on their end couldn't unmangle. (And if such mangling existed, would it stop a person from manually cutting and pasting from the browser? I know we can't stop the cut-and-paste, but would the mangled stuff then require laborious hand editing to clean up? That'd be disincentive enough...)
- Can anyone suggest a good algorithm to automate our detection process of our stolen content? Our current method: we run certain phrases against Yahoo or Google to pick up candidate sites, then we look at each one to see if their content is sufficiently close to ours. We can automate the search and the scan; what I'm looking for is a means to take two pages (strip the html tags) and say statistically these two pages contain paragraphs or bulleted that are are essentially the same (eg the chance of two pages on different sites matching a paragraph that closely by chance is effectively zero).
All suggestions most welcome --