Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Using Statistics::Descriptive for percentiles

by Hena (Friar)
on Jun 02, 2010 at 08:26 UTC ( [id://842731]=perlquestion: print w/replies, xml ) Need Help??

Hena has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,

I needed to count median as well as 25th and 75th percentile (though I can work without them) for a large dataset. For this I decided to use the Statistics::Descriptive.

I have dataset with ~30million numbers, but decided to use first 5m to limit the memory use (if someone has less memory using module, let me know) and I know that the variation between numbers is not that large so it should give me large enough set to work from. However I encountered a small problem. It doesn't return value for 25th percentile or 1st quartile. Has anyone encountered this before, should I go report this as bug?

I'm using Statistics::Descriptive 3.0100 (just hot off CPAN) and perl v5.10.0 (kubuntu 9.10).
  • Comment on Using Statistics::Descriptive for percentiles

Replies are listed 'Best First'.
Re: Using Statistics::Descriptive for percentiles
by JavaFan (Canon) on Jun 02, 2010 at 11:01 UTC
    If memory use is the problem, write the numbers to a file, call the program sort to sort the file, then read the median, 25th and 75th percentile from the appropriate line. Sure, this will take more time (sorting is N log N, finding the Nth element can be done in linear time), but sort knows how to deal with low memory (sort was written when 1Mb of memory was an awful lot, and out of reach for most).

    Sorting 30m numbers took a couple of minutes on my aging box, but once sorted, you can quickly any percentile query.

Re: Using Statistics::Descriptive for percentiles
by salva (Canon) on Jun 02, 2010 at 13:11 UTC
    use Tie::Array::Packed; tie my @n, 'Tie::Array::Packed::Number'; push @n, rand for 1..30_000_000; tied(@n)->sort; for (0, 25, 50, 75, 100) { printf "%03d%%: %6.4f\n", $_, $n[$#n*$_/100]; }
    That runs in one minute in my computer
Re: Using Statistics::Descriptive for percentiles
by Khen1950fx (Canon) on Jun 02, 2010 at 10:05 UTC
    I used this script, an example from the docs, to test memory use. 5m is way too much. Cut it back to 2m, and you're ready to roll:
    #!/usr/bin/perl use strict; use warnings; use Statistics::Descriptive::Weighted; my @data = (1..2000000); my $stat = Statistics::Descriptive::Full->new(); $stat->add_data(@data); print $stat->quantile(1), "\n";
    Update: Maybe this might help:
    #!/usr/local/bin/perl use strict; use warnings; my @numbers = (1..2000000); printf "Percentile %d%% at %f\n", $_, percentile($_,\@numbers) for qw/25 75/; sub percentile { my ($p,$aref) = @_; my $percentile = int($p * $#{$aref}/100); return (sort @$aref)[$percentile]; }
      Doesn't help me unfortunately if I drop it down to 2m nor to 200k.
        That raises questions: how much RAM do you have, what OS are you running, and are you absolutely, positively sure you copied the code correctly?

        If you're not extraordinarily short of RAM (absolute value will vary by OS) and your code is correct, it would be a kindness to the module's author and others unknown to see if you can identify the cause of your problem... and to file a bug report if relevant.

      didn't work for my Perl v5.18.4

      need to change the final line to:

      return (sort {$a <=> $b} @$aref)[$percentile];

      because otherwise it sorts in alphabetical not numeric order.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://842731]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others meditating upon the Monastery: (4)
As of 2024-04-24 12:18 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found