Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

how to generate confidence / weight score for a match?

by chexmix (Hermit)
on Oct 10, 2012 at 22:48 UTC ( #998337=perlquestion: print w/ replies, xml ) Need Help??
chexmix has asked for the wisdom of the Perl Monks concerning the following question:

Good evening Monks --

I am trying to partly 'backfill' or 'backrepair' a database of information about transactions wherein a user has downloaded a piece (or pieces) of data from one of our servers. More specifically, I am trying to repair some partial information, if possible.

The problem: the information in this database has been generated from a motley variety of logfiles, etc. over the past dozen years or so -- the hostname of the download-er is usually captured, to some degree, but it is sometimes only a partial hostname.

Ultimately we'd like to gather geographical information about where users are from, etc., for generating usage statistics, and so on.

What I am struggling to do currently is to match these partial hostnames (I'm working on identifying them _as_ partials, too) against a large text file of hostnames I have gathered from all the available relevant logfiles -- specifically, ones where the hostnames _don't_ have the 'partial' problem.

So, for example, if I have an entry where the host is recorded as 'blahdee.etcet' I'd like to match this to 'blahdee.etcetera.rome.it' or whatever, from my list of complete hostnames, and be able to assign a measure of 'confidence' that it is correct.

I'd guess this would be at least partly based on the number of characters in the partial hostname vs the number in the candidate full hostname ... but my maths aren't really up to* designing such a 'confidence algorithm.'

I can't show code for this part because I'm trying to plan it first -- the rest of the application this is part of is rather vanilla and dull. :) If anyone knows what sort of algorithm I should be looking for here, or whether what I'm trying to do is even feasible, I'd be grateful for a nudge in the right direction.

Thanks,

* I should say they aren't bad, but I have never taken a stats course, for example. I guess I am looking for formulae I might be able to grok without too awful much trouble.

Comment on how to generate confidence / weight score for a match?
Re: how to generate confidence / weight score for a match?
by roboticus (Canon) on Oct 11, 2012 at 00:03 UTC

    chexmix:

    I don't really know the magnitude of your problem, but when I face similar problems, what I do is add a column or two to the database for working on the values. For this case, I might add clean_hostname, and confidence.

    Then, I'd figure out how many different hostnames I have to deal with:

    select count(distinct hostname) as unique_hostnames, count(*) as record_count from my_table where clean_hostname is null

    Next, I'd histogram the hostname frequencies, like:

    select hostname, count(*) cnt from the_table where clean_hostname is null group by hostname order by count(*) desc

    At this point, I'll recognize some "obviously correct" values, and fix them:

    update my_table set clean_hostname = hostname, confidence=100.0 where hostname in ('first.correct.host', 'second.correct.host', <etc>) or hostname like '%.mybiz.com' or hostname like '%.joescompany.net'

    Similarly, I'll recognize some "obviously useless" values, and fix them, too:

    update my_table set clean_hostname = '*GARBAGE*', confidence=100.0 where hostname in ('first.correct.host', 'second.correct.host', <etc>) or hostname like '%.spamzalot.foo'

    Then, I'd loop through again and find out how many records are left to go, and look at the new histogram. By inspection, you should be able to pick out a few more that you're pretty confident in, and fix them accordingly. You'll probably see a few patterns that you can use to correct the others, as well. For example, you might see a few unique mid-level domains that you can assign. For example, perhaps all hostnames ending like '.q9ix' are most likely part of '.q9ix.houston.com'.

    update my_table set clean_hostname = hostname||'houston.com', confiden +ce=95.0 where hostname like '%.q9ix'

    After a few passes, I sometimes find that most of the records are complete, and the data is now good enough that I can call the task 'done'. Other times, I find that there aren't enough patterns to take advantage of, so the data set doesn't reduce in size quickly enough. That's when I pull out perl and start thinking of coding something up to help out.

    By looking at the most common items left over (I'll generally look over the top 50 or 100) and recognizing some patterns, you can tear through pretty quickly. As long as the most common item remaining is a significant percentage of the remaining records, I might persue it manually. If the most common item is too small a percentage, though, it's time to consider other things.

    ...roboticus

    When your only tool is a hammer, all problems look like your thumb.

Re: how to generate confidence / weight score for a match?
by ELISHEVA (Prior) on Oct 11, 2012 at 04:33 UTC

    Assuming you believe your list of good hostnames is relatively complete, then the key issue for confidence is how many other believable matches are there in your list of good host names. The actual number of characters matched won't tell you much because you could have only one match possibility for a host name beginning with "www.z", but 30 hostnames for a match beginning with "www.all". So even though there are more letters in "www.all" than "z", you should have a lot less confidence in any one match on "www.all" than on a match with "www.z".

    In terms of assigning a numeric value to the match, a simple algorithm would be 1/possible_clean_names. This works if all possibilities are equally weighted as 1 and all non-possibilities get a weight of 0. If you want to get fancy and weight the possible clean matches your number would be (weight of selected match)/(sum of  weights for all possible matches).

    The design problem then is to decide what constitutes a possible match, i.e. what it is that makes you or Roboticus intuitively say that there is no other thing that could be a likely match. Note that this rule might vary from log file to log file.

    It would help if you know something about the algorithms used by these older files to truncate hostnames. For example did they use a strict 25 letter cutoff? If so, then for those log files you can assume that any hostname less than 25 characters is an exact match. For bad hostnames that are 25 characters, you could count anything in your clean name file as possible match if it matches on the first 25 characters. If there are two possible matches your confidence would be 50%. If there are four your confidence would be 25%.

    What about top level domain names? Did the log file strip off the top level domain name? If so, for that log file you should ignore failed matches on the top level of the domain name. You would still have an exact match if your clean name file had only one match on everything but the top level domain name. If there were two host names that matched but with different top level domain names then your confidence would drop to 1 in 2 or 50%

    What about typos? Where are these log files getting their truncated host names? If from HTTP headers, then typos are unlikely, but from browser request files or email headers where the sender information can be manipulated by a person and not auto-generated by a computer, typos might be an issue.

    Hopefully, human input is not an issue. But if it is for some of the log files, you might be able to find some statistical data on the web for common typos in host names. If not, you'd need to know something about the likely language and keyboard of the user making the mistake. Typos tend to be language dependent because they rely on things like adjacent keys on the keyboard, phonetic rules of the language, and the degree to which the language's orthography matches its pronunciation.

    Regardless of how you determine your list of potential typos, you could use that to add any hostname that matched except for typos to your list of possible matches. Since some typos are more probable than others, this might be a case where you would want to weight possibilities and use a weighted confidence algorithm. Mismatches due to inevitable truncation would get the highest weight since the bad name can't have ever been right. Mismatches that could have been entered in correctly, i.e. typos, but maybe weren't would get lower weights.

    For instance you could weight a possibility that mismatched because of a log file truncation algorithm as 1.00 and mismatches due to common typos as 0.50 and mismatches due to very rare typos as 0.10. (Note: weights for typos arbitrary and unscientific - I just chose numbers less than one)

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://998337]
Approved by ELISHEVA
Front-paged by Old_Gray_Bear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others cooling their heels in the Monastery: (8)
As of 2014-09-30 22:05 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (385 votes), past polls