Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Text Analysis Tools to compare Slinker and Stinker?

by Cody Pendant (Prior)
on Jan 21, 2003 at 23:40 UTC ( #228890=perlquestion: print w/ replies, xml ) Need Help??
Cody Pendant has asked for the wisdom of the Perl Monks concerning the following question:

I'm a member of an online community.

This community has a member who was so annoying, he was banned. Let's call him Stinker.

Shortly after he was banned, another member joined, who has the same ISP and whose posting style and mannerisms are remarkably similar to the banned member. Let's call him Slinker.

He's better behaved, but a lot of people are convinced that it's just the same old guy, and he's laughing at us for having banned him and let him right back on the next day.

It's causing bad feeling.

So, are there textual analysis tools, linguistic or brute-force pattern-matching tools I can use in Perl? I'm hoping if I can feed a module 1000 lines of Slinker and 1000 lines of Stinker and have it say something like "These two files have a Herzenberger-Foogenboogen Written English Similarity Rating of 97%"?

I would of course have a control, and hopefully compare it with a third member who's above suspicion to show that they had a H.F.W.E.S.R. of like 30%, to make it above board and judicious?
--
“Every bit of code is either naturally related to the problem at hand, or else it's an accidental side effect of the fact that you happened to solve the problem using a digital computer.” M-J D

Comment on Text Analysis Tools to compare Slinker and Stinker?
Re: Text Analysis Tools to compare Slinker and Stinker?
by mojotoad (Monsignor) on Jan 21, 2003 at 23:54 UTC
    Yes, my Precious. It wants linguistic analysis, does it?
    ;)

    I don't know if there's ever going to be a foolproof way to do what you ask, since in order to match the level of discernment we humans are capable of you will have, in the end, ended up with a module capable of true language comprehension.

    Probably a compromise between a full-blown human brain and some dirty matching would be some method (or several, cross-correlated) of fingerprinting the linguistic patterns. Lingua::EN::Fathom might be a good place to start, along with some Bayesian filtering scheme such as Mail::SpamTest::Bayesian. Toss a well-designed neural net in there and you might have something.

    I've considered playing with this sort of thing myself -- let me know if you run with it or find something relevant that I have neglected to mention.

    Matt

Re: (nrd) Text Analysis Tools to compare Slinker and Stinker?
by newrisedesigns (Curate) on Jan 22, 2003 at 00:05 UTC

    Well, you could (depending on the aptitude of this person) set a unique ID cookie and cross-check it with the ISP (netblock or static IP) and the User-Agent to track the fellow, but that doesn't help you right now.

    Sure, you could follow mojotoad's lead and compare the text of his posts. That would make for an interesting follow-up on how you implemented the software.

    I hate to give a non-Perlish answer, but did forgiveness ever cross your mind? I think if this "Slinker" slinked his or her way back to your forum, perhaps they're deserving of another chance. Even on Perl Monks, there's reincarnation.

    Of course, if they do it again, give 'em hell.

    John J Reiser
    newrisedesigns.com

Re: Text Analysis Tools to compare Slinker and Stinker?
by kvale (Monsignor) on Jan 22, 2003 at 00:12 UTC
    I don't know of any such automatic comparison software. It is very hard to parse English, much less attach a semantics to the text. So most folks doing author comparisons stick to simple statistics, such as average length of words, average number of words in a sentence, and relative word frequencies for the two authors.

    The statistical measures still won't prove Slinker is Stinker, however.

    -Mark

Re: Text Analysis Tools to compare Slinker and Stinker?
by Marza (Vicar) on Jan 22, 2003 at 00:41 UTC

    I'll give a mod answer instead of a Perl answer. As a mod of a gaming site, I would say that it is not worth the effort. Trolls have a tendency to get bored and move on. Writing a script to compare posts? Well some of the trolls I have had were smart enough to change the way they spoke. You don't want to ban an innocent do you? In every case they either "grown up" or they moved on.

    So for the amount of time you spend designing this you might find it easier to simply watch slinker, wait for the rules to be violated and then ban him. After awhile he will get bored and move on.

    Then again he may change his attitude. I have seen some of the best members start out annoying pests. It is all part of the maturity thing.

Re: Text Analysis Tools to compare Slinker and Stinker?
by BigLug (Chaplain) on Jan 22, 2003 at 00:43 UTC
    Excellent question!
    Unfortunately, as far as linguistic analysis goes, I can't offer any better solution to those above. However it would be possible to give his ISP two IP numbers along with timestamps and ask if they're the same person. You could explain the reason and point out that you're not asking who the person is, only if they're the same person. The might - they might not.

    In order to stop it happening again (without going through the ISP), you could try the cookie idea .. but I'm not too sure it will work. If he really wanted to, he could find the cookie.

    Instead, you might consider that if he is so desparate to be a part of the community, he'd should be willing to reform. Thus I'd suggest emailing him and letting him think you've done some sort of analysis and thus you "know" its him. Tell him why he was banned in the first place and that he's on a very short leash. If this annoys him to the point where he leaves, then is it any loss? If he re-offends then consider just banning his ISP. If he switches ISP just to get into your community, then let me know what your community is: coz I figure if it's that good, then I want to be in it too!

Re: Text Analysis Tools to compare Slinker and Stinker?
by zengargoyle (Deacon) on Jan 22, 2003 at 00:58 UTC

    i don't know if there is a module that does what you want, but it is relatively easy to check. i've read that even just checking the letter-pair distributions will give good indication whether a group of texts are written by the same person.

    I am not a troll. 'I<space>' => 1 '<space>a' => 2 'am' => 1 'm<space>' => 1 '<space>n' => 1 'no' => 1 ...

    count the letter-pair distribution for a bunch of Stinker and Slinker's posts, and a bunch of other users posts. if you're lucky it will be obvious if they are truly the same person.

    let us know what you find out...;)

Re: Text Analysis Tools to compare Slinker and Stinker?
by Cody Pendant (Prior) on Jan 22, 2003 at 01:07 UTC
    Thanks for your comments, everyone.

    In terms of forgiveness, banning, etc., we didn't ban him for ever, just for a few weeks. That's our policy, like a cooling-off thing. We can't set custom cookies, it's too late for that, and the ISP he uses is huge. Believe me, we've considered those tech solutions.

    I'll check out the modules mentioned by mojotoad, but in the meantime I found Text::Document -- it will compute a JaccardSimilarity and a CosineSimilarity between two strings, plus giving word frequencies, but I don't know anything about what those terms mean. Can anyone enlighten me? If you compare two strings which are identical, they come out as 1. If there's no match at all, it's zero. But in the middle..?
    --
    “Every bit of code is either naturally related to the problem at hand, or else it's an accidental side effect of the fact that you happened to solve the problem using a digital computer.” M-J D

        Another way to uniquely ID him or her is to put a GUID in the query string that is passed on every request. May or may not be easy to set up. Can be obscured (no tampering) and not as easily removed (there's a button in most browsers to "remove cookies").

        This just occurred to me after seeing that zengargoyle uses Opera (see link above).

        John J Reiser
        newrisedesigns.com

Re: Text Analysis Tools to compare Slinker and Stinker?
by Anonymous Monk on Jan 22, 2003 at 03:06 UTC
    (Not really a Perl solution!) There was a publication somewhere recently which showed that by using Winzip or some similar compression package one could deduce which language a document was written in. In short, you compress a set of "reference" documents into an archive, and you repeat this for a number of different languages. You then add the unknown document to each language archive in turn and the one which increased in size the least should be the language of the unknown document (because most/all of the words in it will already be in the archive's dictionary table). It was suggested that this method could also be used to identify authorship -- I haven't used the method so don't know how well it works in practice. (Sorry, don't have a link for the original article). B
Re: Text Analysis Tools to compare Slinker and Stinker?
by dws (Chancellor) on Jan 22, 2003 at 03:23 UTC
    I'm hoping if I can feed a module 1000 lines of Slinker and 1000 lines of Stinker and have it say something like "These two files have a Herzenberger-Foogenboogen Written English Similarity Rating of 97%"?

    Two academics came up with a clever use of zip-based compression for doing this type of analysis. Their scheme, which they first developed to do automatic language detection, but which is also useful for determining authorship, is glossed over here.

    Basically, they noted that if you had a chunk of text from some author who was unknown, but who was a member of a known set, and if you had sample texts from each author in the set, you could concatenate the unknown text with text from each author, looking for the concatenation that compressed best.

    It's a clever approach, and easily implemented in Perl.

      Yeah, that was the one. B.
Re: Text Analysis Tools to compare Slinker and Stinker?
by perrin (Chancellor) on Jan 22, 2003 at 03:34 UTC
Re: Text Analysis Tools to compare Slinker and Stinker?
by pg (Canon) on Jan 22, 2003 at 04:15 UTC
    To be frank, no liguistic analysis solution would be MEANINGFUL and HELPFUL in this case, doesn't matter whether we have a good liguistic analysis solution.

    Think about this at a higher level, and don't sink into technical details too quick. This is actually a good example where TECHNOLOGY does not help with SOCIAL issues.

    Think about this, whatever how prefect the analysis tool is, it would require a big amount of input to yield any MEANINGFUIL result. The reality is that, if Slinker behaves in the same way as Stinker, doesn't matter whether they are one person, most likely, long before your tool give you any MEANINGFUL result to JUSTIFY your decision, you have banned Slinker already.

    On the other hand, if Slinker behaves better, even your nice liguistic analysis tool figures out that Slinker is Stincker, there is still no JUSTIFIED reason for you to ban him. In this case, the only thing a technically capable tool does is, to create negative social feeling.

    Summary:

    Slinker behaves bad Slinker behaves good (I cannot use the word "better" here, as that is logically wrong unless Stinker is in fact Slinker)
    Analysis tool says S == S Takes large amount of data to analysis, most likely, your emotion would help you to make a decision much quicker Yes, he is the same person, but you don't have a reason to ban him, the revelation only affects everyone's feeling in a negative way
    Analysis tool says S != S Stinker still would be banned, doesn't matter whether the result from your tool is correct, your bad feeling would take care of this Too obvious, the analysis is totally a waste of time
      It's not that I don't appreciate the effort, but I'm going to have to ask people to stop trying to help me with the social and administrative aspects of my problem, really.

      I won't explain the rules of the community involved, that would be silly. But if we were convinced that the two people were the same, action would be taken, that's all you need to know.

      If a text-analysis tool proved that the two had very similar writing styles, on a level where it was 1000-to-one that it was coincidental, then that would be considered proof.

      But, having used the Fathom module, see above, I've got nothing conclusive, I'm afraid. It's a very useful tool but hasn't proven or disproven anything. There are fewer differences between two randomly-chosen posters than between Slinker and Stinker, it turns out.

      Another angle of attack on this problem, which I hadn't thought of before, is mis-spellings -- Slinker has spelt "happening" as "happenning" twice, but Stinker gets it right every time...
      --
      “Every bit of code is either naturally related to the problem at hand, or else it's an accidental side effect of the fact that you happened to solve the problem using a digital computer.” M-J D

        This is a great idea for such a problem as yours. Combining readability, tupples, fathom etc, with misspellings (or is it mispellings or missspellings or ...) I wonder how successful we could get a module for comparing two texts. I might take a look at that sometime in the next few weeks. I really think that misspellings might be a great key to comparing two texts. Judging from the above information, I'd have to guess that stinker != slinker. It would be unusually difficult to fix spellings just to get back into a web-community. (IMHO)
        I am not fighting against you (I am 100% sincere), but did you realize that, actually you are not trying to find a "good" tool, but trying to find a tool to "conclusively" satisfy your guess, and to convince your community members and yourself to "believe" something you already pre-determined.

        No good tool goes against your guess, would be a good tool in this situation.

        I am just telling the truth, although it might be difficult to ... ;-)
        But, having used the Fathom module, see above, I've got nothing conclusive, I'm afraid. It's a very useful tool but hasn't proven or disproven anything. There are fewer differences between two randomly-chosen posters than between Slinker and Stinker, it turns out.

        Another angle of attack on this problem, which I hadn't thought of before, is mis-spellings -- Slinker has spelt "happening" as "happenning" twice, but Stinker gets it right every time...

        Leaving alone the issue of whether it is really worth it to spend a lot of time on this mystery, testing services have dealt with some aspects of your problem. Especially the personality tests where they ask you the same question in many slightly different ways and perform some kind of analysis to determine whether you are trying to spoof the test by appearing to be someone you are not.

        Your mention of a spelling discrepency brought to mind a scene from The Princess Bride where Westley was to add poison to one of the drinks, and his adversary was to choose, after Westley had shifted (or not) the position of the glasses. The bad guy goes through a series of qustions and answers trying to figure out Westley's thoughts -- "You placed the poisoned glass closer to me so I'd choose it. But I'm too smart for that, so it must be the one closest to you... But you knew I'd anticipate that move, so it must be the one closest to me after all." And so on for a few minutes or pretty funny dialogue. (I'm sure I got the details turned around, but you get the gist)

        Is this guy deliberately mispelling a word or two just to throw you off? Does it really matter? It still boils down to a guess, doesn't it?

        Even after centuries of linguistic analysis, and lately with some fairly sophisticated computer analysis, scholars are still arguing whether Marlowe wrote the works attributed to Shakespeare, or whether Shakespeare was, indeed, Shakespeare.

        -----
        "Computeri non cogitant, ergo non sunt"

Re: Text Analysis Tools to compare Slinker and Stinker?
by Tommy (Chaplain) on Jan 22, 2003 at 06:29 UTC

    As an alternative to that (and I think the compression alternative is very cool too) I'd consider a mathematical analysis of compiled information on visitor behavior between two suspected identities. Variables to analyze could include:

    Technical
    • Click trails
    • Refering documents
    • IP (of course)
    • HTTP User Agent
    • Times of visits
    • email activity
    • account preference settings
    • passwords (you are the admin are you not)
    Psycological
    • topics of interest, forums/discussions of interest
    • social interaction with similar cliques, without regard to positive or negative tone. (criminal returns to the scene of the crime...)
    • takes the same kind of "flame bait"
    • takes the same stance on clear cut political / moral issues (abortion, death penalty, trees, the trinity, birth control, etc.)

    Just a few ideas among many. In situations such as yours, I couldn't justify throwing so much effort away on garbage like this troll, but I'd try to K.I.S.S. as much as possible. It's more fun to foil with less force. That's why Perl is so cool, and so endearing to the lazy punmiester who'd like to brag (hubris) about how easy it was to triumphantly trounce the troll.

    --
    Tommy Butler, a.k.a. TOMMY
    

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://228890]
Front-paged by mojotoad
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others imbibing at the Monastery: (4)
As of 2014-09-24 03:07 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (244 votes), past polls