http://www.perlmonks.org?node_id=710406

Are you addicted to the Chatterbox? Do you seek not-so-professional advice on how to avoid seeking professional advice for your addiction? Do you need continual validation of your addiction through meaningless stats? Do you seek mastery of your addiction by being the strongest, fastest, best? Are you hoping that your addiction is going to become an Olympic event? Are you going through withdrawal because mojotoad's CB stats haven't been updated in three weeks now? (Join the club.)

Well, do I have an answer for you! It's the beta program of a NEW and IMPROVED (how can something be both new AND improved?) CB stats! It still has some stability issues, but those who just need a fix for their craving can find one here! It's similar to the CB stats you're used to, but with some minor improvements (and deprovements) (which is fitting for depravity).

Well, you really shouldn't have clicked on that "readmore" button, but since you have (maybe implicitly by having it always displayed), you are obviously in the target audience. So let's get down to some details. First off, the warnings:

  • This will probably make your addiction worse.
  • The URL is not finalised. I'm hoping to move it to http://tanktalus.perlmonk.org, but Petruchio is currently having some fun trying to work on some background issues before that will work. So, in the meantime, I'm putting it on my ISP's web service. The URL is now finalised. Thanks, Petruchio!
  • The data gathering is fragile. Right now, when you type something in to your CB client, it has to get to the perlmonks database (a path which we presume is working), then ambrus' cbstream needs to pick it up, send it to whatever FreeNode IRC server it's connected to, bounce across the servers to whichever IRC server I am connected to, and then my Xchat script I can log it. The fragile parts include IRC (netsplits--) and Xchat (if I'm restarting X, that has to come down, and I need to restart X at some point soon to take advantage of a new X server and new version of KDE). Once ambrus' cbstream server is up and running, I'll be able to rewrite the data gathering to be backgroundable, and then it will only be affected by a full kernel upgrade instead of just fiddling with X. Though it's backgrounded now via IRC, there are still other ways I'm investigating using which likely will be less fragile than relying on FreeNode to never have any netsplits.

Next, some definitions. As I wrote this thing from scratch basically, I had full control over what constituted what. So I thought I'd share them so that you can fully exploit your addiction. The following is a rough approximation of the code I'm using for parsing right now, subject to change.

# some regex's I use in multiple places. our $user = qr{ \[([^\]\s][^\]]+)\] | \[\s\Qhttp://(?:www.)perlmonks.(?:org|com)/?node(?:_id)=\E([^\s;&= +]+)\s\| }x; my $aggress_user = qr{(?: ([^[]\S+) | $user )}x; my $aggress = qr{ /me\s+(?:slaps?|hits?|strikes?|kicks?|throws?\b.*?\bat)\s+$aggress +_user }x; #... for my $test ( [ question => qr/\?(?:\s|$)/ ], [ yell => qr/\!(?:\s|$)/ ], [ aggressor => sub { if (/$aggress/) { require URI::Escape; my $user = URI::Escape::uri_unescape($+); # make sure the user exists... $user = CBStats::UserR::fetch($user); $user && $user->nodeid() > 0; } else { return 0; } } ], [ happy => qr/(?:^|\s|\b)[:;B8]-?[)D}P>]+|[(]-?[ +:;](?:$|\s|\b)/ ], [ sad => qr/(?:^|\s|\b):['`]?-?\(+|[)]-?['`]?[ +:](?:$|\s|\b)/ ], [ thought => qr/\.oO\s*\(.*\)/ ], [ action => qr/\/me/ ], [ aggressee => sub { if (/$aggress/) { require URI::Escape; my $user = URI::Escape::uri_unescape($+); # make sure the user exists... $user = CBStats::UserR::fetch($user); $user && $user->nodeid() > 0 ? $user->nick( +) : ""; } else { "" } } ], [ words => sub { #require Text::ParseWords; my @x = split ' ', $_; scalar @x; } ], [ soliliquay => sub { my $prev = $self->find_where('MSGID IN (SELECT +MAX(MSGID) FROM LOGS WHERE MSGID < ?)', $self->msgid()); if ($prev) { $prev->from() eq $self->from() ? ($prev->soliliquay() || 0) + 1 : 1; } else { undef } } ], ) { my ($action, $check) = @$test; if (not defined $self->$action()) { $self->$action( ref $check eq 'Regexp' ? (/$check/ ? 1 : 0) + : ref $check eq 'CODE' ? $check->() : $check ); } } # Also ... karma is found via: qr/ ^$user(\+\+|\-\-) \s* \#?\s*(.*\S) /x
This is here as an explanation of the "big numbers" shown in the stats. The output is trivially derived from the above true/false designations (well, mostly true/false). Well, trivial for a human, but some of these got to be some very complex subqueries in SQL. Of course, if someone wants a change to the above, please let me know.

The process model is moderately convoluted:

  1. The data gathering is done "live." This just records the raw data from its source (currently IRC).
  2. Data transformation. There is a cron job running every 6 minutes to see what has transpired recently, and run the above code. Node that the user fetch function includes querying PM if I've never seen that nick before (I'm presuming that nicks never change node ids here). Once I have a cache of a user, I can also set the user to be hidden so they don't show up in the stats (this convolutes the SQL like I would never have thought possible). This is just perl, and should only need updating if I want new data.
  3. Output. There is a cron job running every hour to run a couple dozen SQL statements in a Template Toolkit plugin (via ttree), and then upload the resulting html file to the destination. This job also will prune anything over a week old (I hope - I don't have a week's worth of data yet, but that's why the "earliest" and "latest" timestamps are showing on the stats page), and run the data transformation again (just in case there's data from between the last transformation cron run and now).
A little expansion on a point: if you don't want your nick to show up on the stats, despite this all being for fun, simply let me know. I have an entry in a table for this express purpose, and already have 2 3 people on the list. Your lines will STILL be counted in the aggregate, but your name won't show up. That means that if you wrote 200 lines in the week, it'll still affect the "most active times" and any monk or user references you make will still be added into the total numbers. However, if you were the last user to mention "[bart]", as an example, the name that will show up in the stats will actually be the PREVIOUS person to mention "[bart]". Well, the previous non-hidden user anyway. Which, of course, means that the stats are invalid. But that just means that we already know they're invalid instead of pretending otherwise ;-)

The backend is DB2. Why? Because a) it's probably faster than DBD::CSV ;-), b) it's what we use at $work, and, most importantly, c) the point of these statistics was NOT the generation of the statistic, but to learn RDBMS tools and techniques, and especially to learn some more complex SQL. I'd say it's been a resounding success on the last point, even if the rest of the system falls over tomorrow.

So, what to do next? I'm hoping for two things: 1) perlmonk.org issues to be resolved so I can move the URL (*DONE*), and 2) a more stable CB feed that doesn't chew up more PM resources at which point I can remove my dependency on X (dependency removed, stability still being worked on). Once the first issue is resolved, I'll send a private message to SiteDocClan to get the site FAQ updated to the new stats page (sorry, mojotoad) (*DONE*).

Update: I should point out that when I query to figure out who is the "top" of each category, if there is actually a tie, I favour the newest user. That means that if two people have 87 messages over the last week, the one who joined more recently (i.e., has a higher node ID on perlmonks) gets the higher rank. OTOH, if two people have tied for attacking others, the one with the higher node ID will be given the benefit of the (relative) inexperience and get the lower rating (which could push him/her off the list of two). This may change to the latest case (i.e., the last smiley, the last post, the last attack, whatever).

Update2: Changed the data gathering, but only slightly. (Thanks, ambrus for the base code.)

Update3: Changed the URL. Future changes will be noted on that site, not in this node.

Replies are listed 'Best First'.
Re: Chatterbox Addicts not-so-anonymous
by jdporter (Paladin) on Sep 10, 2008 at 18:04 UTC

    I'm not necessarily suggesting you change your architecture, but perhaps more for the benefit of others who might be considering a similar undertaking (a passive CB bot): Rather than rely on IRC (let alone X, which clearly shouldn't be necessary for a task like this), I recommend horking CB content from one of the "cb history" sites, i.e. cb60 (and as backup, cbhistory). Just hork and parse, and you're done. (I can provide code for this, if anybody wants it.) The major downside is that there can be a noticeable delay between when someone talks in the cb and when the stats are updated to reflect it. If lower delay times are needed, one can of course fetch the cb xml feed.

    Between the mind which plans and the hands which build, there must be a mediator... and this mediator must be the heart.
Re: Chatterbox Addicts not-so-anonymous
by jvector (Friar) on Sep 10, 2008 at 18:49 UTC
    nice! I will have to remember never to say
    /me throws qr{kisses|lifeline|money|self} at [anybody]
    'cos it would be misinterpreted as being aggressive.. B-/
Re: Chatterbox Addicts not-so-anonymous
by Limbic~Region (Chancellor) on Sep 10, 2008 at 23:44 UTC