Beefy Boxes and Bandwidth Generously Provided by pair Networks Bob
"be consistent"
 
PerlMonks  

(contest) Help analyze PM reputation statistics

by demerphq (Chancellor)
on Sep 14, 2004 at 18:08 UTC ( #390930=monkdiscuss: print w/ replies, xml ) Need Help??

Hi All.

A discussion in the CB lead to me running a query to get counts of nodes by their reputation. The original idea being to see if selected best nodes was affecting the rep of the nodes involved. Whether it is or not isn't totally clear, but I thought this might be an interesting opportunity for the monks out there to show how clever they are in analyzing this data with perl and producing an output suitable for displaying in html form.

What I was thinking is that people should preferably reply with a little perl script and its output showing some interesting analysis of the data. If you want to come up with funkier stuff that isnt suitable to html representation then fine, post your code and a link.

Whoever comes up with the coolest analysis (as defined by whichever gods choose to help me judge) will win some sort of prize. (As decided the by the same group of gods mentioned before :-). The prize is just a little bonus, I figure there will be enough interest without it, but winning prizes always motivates people :-)

Heres the data in %hash (Rep=>Count) form: (If you decide to participate you can leave out the hash of data in your code submission, we will just assume its the same as this one.)

my %rep_stats=(1=>24694, 2=>23551, 0=>22855, 3=>21340, 4=>19598, 5=>17 +779, 6=>16575, 7=>15200, 8=>13695, 9=>12824, 10=>11722, 11=>10545, 12=>9829 +, 13=>8860, 14=>8022, 15=>7363, 16=>6692, 17=>6044, 18=>5465, 19=>5040, -1=>4747, 20=>4575, 21=>4101, 22=>3910, 23=>3494, 24=>3135, 25=>2702, 26=>2631, 27=>2364, 28=>2233, -2=>2174, 29=>2053, 30=>1823, 31=>1775, 32=>1606, 33=>1530, 34=>1397, 35=>1269, -3=>1237, 36=>1151, 37=>1144, 38=>1059, 39=>980, 40=>961, -4=>898, 41=>885, 43=>833, 42=>761, 44=>68 +6, 45=>685, 46=>663, 47=>652, 48=>589, 49=>564, -5=>551, 51=>494, 50=>478 +, 52=>474, 54=>444, -6=>444, 53=>429, 57=>400, 55=>393, 56=>355, 58=>322 +, 59=>321, 60=>310, 61=>286, -7=>283, 62=>266, -8=>261, 64=>243, 63=>223 +, 70=>216, 66=>215, 65=>212, 67=>211, -9=>206, 68=>194, -10=>190, 71=>18 +0, 69=>176, 72=>173, 74=>163, -11=>154, 76=>148, 73=>141, 75=>138, 77=>13 +5, -12=>134, 79=>120, 82=>120, -13=>114, 80=>109, 78=>106, -14=>106, 81=> +100, 83=>92, -16=>90, 85=>88, 89=>87, 84=>82, 87=>74, 90=>74, 88=>74, 91=>7 +2, 92=>71, 86=>68, -15=>66, 97=>63, 98=>60, 94=>59, -17=>57, -19=>55, 93= +>50, 96=>49, 102=>48, 105=>47, 95=>47, 99=>46, -20=>46, -18=>44, 109=>41, 101=>39, 100=>37, 104=>36, 116=>36, 111=>34, 103=>34, -22=>33, 108=>32 +, 107=>31, 106=>30, 112=>28, 121=>27, 113=>25, 110=>24, -29=>23, 120=>23 +, 117=>22, -21=>22, 118=>21, 122=>21, -23=>19, 125=>19, 129=>18, 115=>18 +, 126=>18, 132=>17, 114=>16, 124=>16, 123=>16, 128=>16, -25=>15, 137=>15 +, -26=>14, -28=>13, 136=>13, -32=>12, -24=>12, 119=>12, -27=>12, 127=>12 +, 135=>11, 143=>11, -33=>11, 139=>11, 134=>10, -38=>10, -31=>10, 146=>10 +, 138=>10, 133=>10, -34=>9, 142=>9, 160=>9, 145=>8, 164=>8, 131=>8, -53= +>7, 156=>7, 148=>7, 140=>7, 130=>7, 162=>7, 151=>6, 144=>6, -39=>6, -42=>6 +, -37=>6, 147=>6, 159=>6, -40=>6, -46=>5, 155=>5, 176=>5, 167=>5, -36=>5 +, 175=>5, 166=>5, -30=>5, 172=>4, 149=>4, 150=>4, 152=>4, 161=>4, 183=>4 +, -43=>4, 165=>4, -44=>4, 168=>4, 170=>4, 191=>4, 171=>4, -60=>3, -35=>3 +, 187=>3, 169=>3, 163=>3, 182=>3, -54=>3, -58=>3, -45=>3, 141=>3, -51=>3 +, 158=>2, -90=>2, 194=>2, 195=>2, 258=>2, 186=>2, 173=>2, 198=>2, 178=>2 +, 179=>2, -52=>2, -47=>2, -68=>2, 202=>2, -55=>2, -56=>2, 204=>2, -106=> +2, 180=>2, 215=>2, 217=>2, 153=>2, 154=>2, 228=>2, 254=>2, -48=>1, -49=>1 +, 322=>1, 288=>1, -82=>1, -83=>1, -84=>1, -88=>1, 253=>1, 327=>1, 181=>1 +, 292=>1, 328=>1, 293=>1, 255=>1, 184=>1, 188=>1, -50=>1, -223=>1, -57=> +1, 299=>1, -59=>1, -93=>1, 440=>1, 261=>1, 263=>1, 336=>1, 193=>1, 197=>1 +, 199=>1, -61=>1, -65=>1, -66=>1, -67=>1, 571=>1, 456=>1, 212=>1, 226=>1 +, 229=>1, 157=>1, 230=>1, 273=>1, 304=>1, 235=>1, 236=>1, 463=>1, 237=>1 +, 349=>1, 238=>1, 279=>1, -70=>1, -71=>1, 239=>1, -203=>1, 207=>1, 242=> +1, 243=>1, 245=>1, 174=>1, -41=>1, );

Hope you all find this interesting, and remember XP and Node Rep are pretty meaningless things, especially when they dont mean anything. :-)


---
demerphq

    First they ignore you, then they laugh at you, then they fight you, then you win.
    -- Gandhi

    Flux8


Comment on (contest) Help analyze PM reputation statistics
Download Code
Re: (contest) Help analyze PM reputation statistics
by VSarkiss (Monsignor) on Sep 14, 2004 at 18:53 UTC

    Whoever comes up with the coolest analysis ... will win some sort of prize
    Let me guess: the prize is more XP. ;-)

      If it means I get to see, "you have -4 XP to level vroom", I'm all for it.

      "There is no shame in being self-taught, only in not trying to learn in the first place." -- Atrus, Myst: The Book of D'ni.

Re: (contest) Help analyze PM reputation statistics
by idsfa (Vicar) on Sep 14, 2004 at 18:59 UTC
    print <<EOH Content-type: text/html <html><body> <a href="http://www.quotegarden.com/statistics.html">Results</a> </body></html> EOH

    If anyone needs me I'll be in the Angry Dome.

      Heh, nice link. :-) Its particularly amusing to me because that second quote is reduced version of a quote I know very well. A friend of my family introduced me to computers at a very young age (or it was for the time, not nowadays of course) as part of his post-doc work in using computers as an educational medium. He used to a have a line printer terminal in his office (where I got my one and only chance to mess with APL), on the wall in front of it he had a print out that said (figlet style)

      If you torture the data long enough it will confess to anything

      Hes been a professor of psychology focusing on statistical methods of representing data for a long time and I suppose that quote was something important to him. All i can say is its a line I'll never forget.


      ---
      demerphq

        First they ignore you, then they laugh at you, then they fight you, then you win.
        -- Gandhi

        Flux8


        That is a great quote. I took some time too look, and it appears to originate from Ronald Coase, a British economist. So now you have an attribution. :-)

        Makeshifts last the longest.

Re: (contest) Help analyze PM reputation statistics
by TheEnigma (Pilgrim) on Sep 14, 2004 at 19:28 UTC
    and remember XP and Node Rep are pretty meaningless things, especially when they dont mean anything. :-)


    I don't know if it will be considered Bad Monk Behavior to post the following link, but I just posted this node in the discusstion section that sheds a little different light on the issue of Node Rep than has perhaps been seen before.

    TheEnigma

Re: (contest) Help analyze PM reputation statistics
by hardburn (Abbot) on Sep 14, 2004 at 19:58 UTC

    Here's one that figure out the total ammount of node XP on PM, how many days it would take a Saint to get that many votes, and the position of the total number in pi (thanks to LWP::UserAgent and http://www.angio.net/pi/piquery).

    my $PI_URI = 'http://www.angio.net/pi/bigpi.cgi'; # Fill this in with the rep stats hash from parent node my %REP_STATS = ( . . . ); my $total_rep; while( my ($key, $value) = each %REP_STATS ) { $total_rep += $key * $value; } my $saint_days = $total_rep / 40; my $in_pi = do { use LWP::UserAgent; my $ua = LWP::UserAgent->new; # Couldn't get proper param passing to work, so work-around # by making the query string ourselves. Fix later. # my $response = $ua->get( $PI_URI . "?UsrQuery=$total_rep" ); local $_ = $response->content; m!The string <B>\Q$total_rep\E</b> was found at position (\d+) +!; $1; }; print "Total XP: $total_rep"; print "Saint Days: $saint_days"; print "In pi at: $in_pi";

    For the record, I got:

    Total XP: 4068710 Saint Days: 101717.75 In pi at: 3246447

    "There is no shame in being self-taught, only in not trying to learn in the first place." -- Atrus, Myst: The Book of D'ni.

Re: (contest) Help analyze PM reputation statistics
by blokhead (Monsignor) on Sep 14, 2004 at 20:01 UTC
    If we consider the node reputation as a random variable, we can perform some interesting analyses. The entropy of the reputation random variable is 5.27, meaning that the theoretical lower bound for storing the information contained in the node reputations is 5.27 bits per post. So when someone says "node reputation isn't worth 2 bits," they're wrong -- it is actually worth at least 5.27 bits. ;)
    use List::Util 'sum'; my $sum = sum values %rep_stats; my $entropy = sum map { -($_/$sum) * log($_/$sum) / log(2) } values %rep_stats; printf "Total entropy: %.05f\n", $entropy;
    An interesting statistic would be whether the entropy of the reputation random variable is going up or down over time. Then we could say whether node reputation was becoming more or less meaningful.

    blokhead

      Being that $NORM is also a measurement of the trend of node reputation, how is $entropy useful as an ancillary view of the same trend?

        $NORM measures the average reputation of recent nodes. It answers the question, "Are nodes rated high or low?"

        Entropy measures the information content of node reputation. It answers the questions, "How much does the node's reputation tell us? How meaningful is the assignment of reputation?"

        Say $NORM is 11. Well, this can happen if all recent nodes have reputation 11. If this is the case, the entropy is 0 because knowing that a node has reputation 11 tells us nothing about the node.

        On the other hand, maybe among all recent nodes, an equal number of them have reputation 1, 2, 3, .. up to 22. This situation also gives us $NORM = 11. But here, knowing the reputation of a node gives us much more information. Reputation in itself is more meaningful in this scenario because it can tell us something. The something it is telling us is information in the theoretical sense.

        $NORM tells us whether nodes are given high or low reputations on average (although the variance might be useful to know as well). It is an analysis of the values of a random variable. Entropy is completely orthogonal; independent of how high or low the nodes are ranked, it tells us how informative node reputation really is. It is an analysis of the uncertainty of a random variable. You can have any combination of low or high average with low or high entropy.

        blokhead

Re: (contest) Help analyze PM reputation statistics
by tachyon (Chancellor) on Sep 15, 2004 at 06:13 UTC

    Here is a bit of code that does some basic graphing. Here is the graph. The moving averages all show the same thing. There is an unexpected data anomoly around 155-165 XP. Note I have jacked the moving averages up by 0.5 to separate them cleanly from the raw data.

      Nice job!

      But I cannot run it... I am on a winders machine, and ActiveState doesnt have the GD package for some reason.

      I also don't have a c compiler to build it.

      Oh well, I will have to be contect to admire from afar...

      For now.
OK, here's your analysis (w/ picture!)
by tmoertel (Chaplain) on Sep 15, 2004 at 06:30 UTC
    "Use the right tool for the job," certainly applies here. Therefore, my Perl program is going to fire up R from The R Project for Statistical Computing, feed R our data, and run a regression analysis on it.

    Here's the code. (Note: I'm truncating the data for conciseness.)

    #!/usr/bin/perl -wl use strict; use File::Temp qw( tempfile ); my %rep_stats=(1=>24694, 2=>23551, 0=>22855, ... -41=>1, ); # only keep 0 < XP < 100 because we want the more mainstream # values and not the far-out ones my @xp_sorted = grep { $_>0 && $_<100 } sort { $a <=> $b } keys %rep_stats; # generate our R commands (my $r_commands = <<EOF) =~ s/^ //mg; xp <- scan() @xp_sorted count <- scan() @{[ @rep_stats{@xp_sorted} ]} summary(lm(log10(count) ~ xp + I(xp^2))) EOF # now, we stuff our R commands into a tempfile, # which we'll use as STDIN my $tmp = tempfile() or die "can't open tempfile: $!"; print $tmp $r_commands; seek $tmp, 0, 0 or die "can't seek to BOF: $!"; open STDIN, ">&", $tmp or die "can't dup tmp->STDIN: $!"; # finally, we exec R, which will read our commands # from STDIN (the temp file will be deleted automatically # when the program exits) my @cmd = qw(R --no-save --no-init-file --no-restore-data --slave); exec @cmd; die "couldn't exec @cmd : $1"; # should never get here
    Now, let's run the above program and see the output:
    Read 99 items Read 99 items Call: lm(formula = log10(count) ~ xp + I(xp^2)) Residuals: Min 1Q Median 3Q Max -0.111459 -0.032358 0.004959 0.024387 0.109470 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 4.442e+00 1.239e-02 358.49 <2e-16 *** xp -4.194e-02 5.719e-04 -73.33 <2e-16 *** I(xp^2) 1.467e-04 5.541e-06 26.48 <2e-16 *** --- Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 Residual standard error: 0.04027 on 96 degrees of freedom Multiple R-Squared: 0.9975, Adjusted R-squared: 0.9974 F-statistic: 1.889e+04 on 2 and 96 DF, p-value: < 2.2e-16
    Woohoo! It looks like we have a good fit. Converting our fitted model into a Perl function that estimates the count of nodes with a given XP, we get the following:
    sub estimate_count_from_xp($) { my $xp = shift; 10 ** ( 4.442 - 4.194e-2 * $xp + 1.467e-4 * $xp**2 ); }
    (Because I fitted the model against log10(count), we had to exponentiate the resulting formula to get an estimation function for count.)

    Just to see how good our model is, take a look at this plot comparing the actual values (dots) versus the estimated values (line). That's pretty much "on the money."

    Cheers,
    Tom

     

    Tom Moertel : Blog / Talks / LectroTest / PXSL / Coffee / Movie Rating Decoder

      ++ for the technique. But the result seems pretty strange - the 1.467e-4 * $xp**2 is the dominant part in the exponent so it seems that the count should grow with the xp from some point.
        You're right that the quadratic term will, eventually, dominate. However, for the range I considered (0 < XP < 100), adding that term results in a slightly better fit.

        But, the fit is nearly as good without it, and so for interpretive purposes (instead of get-the-best-fit purposes), dropping the quadratic term makes for a better model:

        Read 99 items Read 99 items Call: lm(formula = log10(count) ~ xp) Residuals: Min 1Q Median 3Q Max -0.16823 -0.10095 -0.01733 0.07757 0.25553 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 4.194851 0.023379 179.43 <2e-16 *** xp -0.027268 0.000406 -67.17 <2e-16 *** --- Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 Residual standard error: 0.1154 on 97 degrees of freedom Multiple R-Squared: 0.979, Adjusted R-squared: 0.9787 F-statistic: 4512 on 1 and 97 DF, p-value: < 2.2e-16
        With this model, our estimating function is as follows:
        sub estimate_count_from_xp($) { my $xp = shift; 10 ** ( 4.195 - 0.2727 * $xp ); }
        From this, it's easy to see that we have classic exponential decay w.r.t. XP.

        Does this match your intuition?

Re: (contest) Help analyze PM reputation statistics
by ambrus (Abbot) on Sep 15, 2004 at 19:09 UTC

    To make some even more interesting analysis, it would be good to have the same data broken down to sections and node types.

Re: (contest) Help analyze PM reputation statistics
by CountZero (Chancellor) on Sep 15, 2004 at 21:00 UTC
    I didn't use much Perl, other than to reformat the data so I could load it into Excel.

    Average node value is 11.87. The median value (as many nodes have lower or higher values as the median) is between 7 and 8. 90% of the nodes have a value between 0 and 40 XP and 99% of the nodes have a value between -8 and 90. Both ranges are centered between the extremes of -223 and 571 XP, i.e. 5% of the nodes has a lower XP than 0 or a higher XP than 40, resp. 0.5% is worse than -8 or better than 90.

    The XP distribution is not a standard Bell shaped Gaussian distribution, but something which peaks around the average value and quickly drops down to low values, with long low tails to lower and higher values.

    What does it tell us:

    • average and median are close together and most of the nodes cluster around these values: most of the nodes are thus of average value (for whatever meaning you may give to average).
    • Perlmonks are quicker to give positive XP than negative XP as 95% of the nodes have no negative XP.
    • Relatively speaking few nodes are very bad or very good (only 1% is worse than -8 or better than 90), but even here we are quicker to praise than to chastise (the positive tail goes much higher than the negative tail goes deeper, for an equal number of nodes).
    Does this mean that most of the nodes are well written or that that generally the Monastery is easy on the authors of nodes? That alas is something these figures cannot tell us.

    CountZero

    "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law

      I don't think you need statistics to draw that conclusion for you. Personally I feel safe to assume from my experience with the site generally being pleasant and informative that it is because nodes tend to be well written.

      Or rather, I should say, not bad. As you saw from your analysis, the overwhelming majority of nodes end up with a handful of upvotes. To state it in a cold and detached way, that means that the average node is a tendentially better than entirely useless.

      I like Perlmonks. It's a comfortable community of geeks.

      Makeshifts last the longest.

Re: (contest) Help analyze PM reputation statistics
by Limbic~Region (Chancellor) on Sep 28, 2004 at 13:35 UTC
    demerphq,
    After a recent conversation in the CB, I (as do jdporter, blokhead, and likely others) feel it would be much better if the dump included:
    • Node Rep
    • Node Type
    • Create Time
    • Front Paged
    For instance, just today I was wondering if $NORM would change dramatically if things like standard deviation and variance were taken into account.

    Cheers - L~R

      I dont get it. You want a list of this information for all ~400k records we have in the DB? How is that supposed to happen?


      ---
      demerphq

        First they ignore you, then they laugh at you, then they fight you, then you win.
        -- Gandhi

        Flux8


        demerphq,
        I can see where making this information available in a database (as simple as SQLite for instance) offsite being a policy problem. With a couple slight modifications, it might be a bit more feasible:
        • Forget the frontpaged flag
        • Break things out in buckets by day, such as 2004-12-25
        • The next level of buckets would be node type
        • The last level of buckets would be reps and corresponding counts
        I was expecting to do the breakout work off-site though.

        Cheers - L~R

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: monkdiscuss [id://390930]
Approved by atcroft
Front-paged by FoxtrotUniform
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others having an uproarious good time at the Monastery: (5)
As of 2014-04-20 02:45 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    April first is:







    Results (485 votes), past polls