Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

OK, here's your analysis (w/ picture!)

by tmoertel (Chaplain)
on Sep 15, 2004 at 06:30 UTC ( [id://391096]=note: print w/replies, xml ) Need Help??


in reply to (contest) Help analyze PM reputation statistics

"Use the right tool for the job," certainly applies here. Therefore, my Perl program is going to fire up R from The R Project for Statistical Computing, feed R our data, and run a regression analysis on it.

Here's the code. (Note: I'm truncating the data for conciseness.)

#!/usr/bin/perl -wl use strict; use File::Temp qw( tempfile ); my %rep_stats=(1=>24694, 2=>23551, 0=>22855, ... -41=>1, ); # only keep 0 < XP < 100 because we want the more mainstream # values and not the far-out ones my @xp_sorted = grep { $_>0 && $_<100 } sort { $a <=> $b } keys %rep_stats; # generate our R commands (my $r_commands = <<EOF) =~ s/^ //mg; xp <- scan() @xp_sorted count <- scan() @{[ @rep_stats{@xp_sorted} ]} summary(lm(log10(count) ~ xp + I(xp^2))) EOF # now, we stuff our R commands into a tempfile, # which we'll use as STDIN my $tmp = tempfile() or die "can't open tempfile: $!"; print $tmp $r_commands; seek $tmp, 0, 0 or die "can't seek to BOF: $!"; open STDIN, ">&", $tmp or die "can't dup tmp->STDIN: $!"; # finally, we exec R, which will read our commands # from STDIN (the temp file will be deleted automatically # when the program exits) my @cmd = qw(R --no-save --no-init-file --no-restore-data --slave); exec @cmd; die "couldn't exec @cmd : $1"; # should never get here
Now, let's run the above program and see the output:
Read 99 items Read 99 items Call: lm(formula = log10(count) ~ xp + I(xp^2)) Residuals: Min 1Q Median 3Q Max -0.111459 -0.032358 0.004959 0.024387 0.109470 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 4.442e+00 1.239e-02 358.49 <2e-16 *** xp -4.194e-02 5.719e-04 -73.33 <2e-16 *** I(xp^2) 1.467e-04 5.541e-06 26.48 <2e-16 *** --- Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 Residual standard error: 0.04027 on 96 degrees of freedom Multiple R-Squared: 0.9975, Adjusted R-squared: 0.9974 F-statistic: 1.889e+04 on 2 and 96 DF, p-value: < 2.2e-16
Woohoo! It looks like we have a good fit. Converting our fitted model into a Perl function that estimates the count of nodes with a given XP, we get the following:
sub estimate_count_from_xp($) { my $xp = shift; 10 ** ( 4.442 - 4.194e-2 * $xp + 1.467e-4 * $xp**2 ); }
(Because I fitted the model against log10(count), we had to exponentiate the resulting formula to get an estimation function for count.)

Just to see how good our model is, take a look at this plot comparing the actual values (dots) versus the estimated values (line). That's pretty much "on the money."

Cheers,
Tom

 

Tom Moertel : Blog / Talks / LectroTest / PXSL / Coffee / Movie Rating Decoder

Replies are listed 'Best First'.
Re: OK, here's your analysis (w/ picture!)
by zby (Vicar) on Sep 15, 2004 at 07:08 UTC
    ++ for the technique. But the result seems pretty strange - the 1.467e-4 * $xp**2 is the dominant part in the exponent so it seems that the count should grow with the xp from some point.
      You're right that the quadratic term will, eventually, dominate. However, for the range I considered (0 < XP < 100), adding that term results in a slightly better fit.

      But, the fit is nearly as good without it, and so for interpretive purposes (instead of get-the-best-fit purposes), dropping the quadratic term makes for a better model:

      Read 99 items Read 99 items Call: lm(formula = log10(count) ~ xp) Residuals: Min 1Q Median 3Q Max -0.16823 -0.10095 -0.01733 0.07757 0.25553 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 4.194851 0.023379 179.43 <2e-16 *** xp -0.027268 0.000406 -67.17 <2e-16 *** --- Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 Residual standard error: 0.1154 on 97 degrees of freedom Multiple R-Squared: 0.979, Adjusted R-squared: 0.9787 F-statistic: 4512 on 1 and 97 DF, p-value: < 2.2e-16
      With this model, our estimating function is as follows:
      sub estimate_count_from_xp($) { my $xp = shift; 10 ** ( 4.195 - 0.2727 * $xp ); }
      From this, it's easy to see that we have classic exponential decay w.r.t. XP.

      Does this match your intuition?

        Thank's. I thought I was missing something.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://391096]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others contemplating the Monastery: (6)
As of 2024-03-19 09:15 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found