|Think about Loose Coupling|
Re: how to generate confidence / weight score for a match?by roboticus (Chancellor)
|on Oct 11, 2012 at 00:03 UTC||Need Help??|
I don't really know the magnitude of your problem, but when I face similar problems, what I do is add a column or two to the database for working on the values. For this case, I might add clean_hostname, and confidence.
Then, I'd figure out how many different hostnames I have to deal with:
Next, I'd histogram the hostname frequencies, like:
At this point, I'll recognize some "obviously correct" values, and fix them:
Similarly, I'll recognize some "obviously useless" values, and fix them, too:
Then, I'd loop through again and find out how many records are left to go, and look at the new histogram. By inspection, you should be able to pick out a few more that you're pretty confident in, and fix them accordingly. You'll probably see a few patterns that you can use to correct the others, as well. For example, you might see a few unique mid-level domains that you can assign. For example, perhaps all hostnames ending like '.q9ix' are most likely part of '.q9ix.houston.com'.
After a few passes, I sometimes find that most of the records are complete, and the data is now good enough that I can call the task 'done'. Other times, I find that there aren't enough patterns to take advantage of, so the data set doesn't reduce in size quickly enough. That's when I pull out perl and start thinking of coding something up to help out.
By looking at the most common items left over (I'll generally look over the top 50 or 100) and recognizing some patterns, you can tear through pretty quickly. As long as the most common item remaining is a significant percentage of the remaining records, I might persue it manually. If the most common item is too small a percentage, though, it's time to consider other things.
When your only tool is a hammer, all problems look like your thumb.