Just one way I see that you can get a speedup is to not use Statistics::Descriptive (for the geometric mean). You can roll your own.
Here is a benchmark - it compares your method with one where I didn't use Statistics::Descriptive;
Depending on the number of scores in the array, my method shows a speedup of 12 times, for 1000 scores to 20 times, for 100 scores in the array.
#!/usr/bin/perl
use strict;
use warnings;
use Statistics::Descriptive;
use Benchmark qw/ cmpthese /;
use List::Util qw/ sum min max /;
for my $len (100, 500, 1000) {
my @scores = map {sprintf "%.3f", rand 29} 1 .. $len;
print "\n\nNumber of scores: $len\n";
my $results = cmpthese (-1,
{
stat => sub {
my $stat = Statistics::Descriptive::Full->new();
$stat->add_data(@scores);
@scores = sort {$a <=> $b} @scores;
my ($min, $max) = @scores[0, -1];
my $len = @scores; # number of scores
my $mean = sum(@scores) / @scores;
my $geo_mean = $stat->geometric_mean();
},
util => sub {
my $len = @scores; # number of scores
my $min = min @scores;
my $max = max @scores;
my $mean = sum(@scores) / @scores;
my $geo_mean = geo_mean(@scores);
},
},
);
}
sub geo_mean {
my $prod = 1;
$prod *= $_ for @_;
return $prod ** (1/ @_);
}
This benchmark produced the following output.
Number of scores: 100
Rate stat util
stat 5265/s -- -95%
util 103936/s 1874% --
Number of scores: 500
Rate stat util
stat 1652/s -- -93%
util 24549/s 1386% --
Number of scores: 1000
Rate stat util
stat 890/s -- -93%
util 12274/s 1278% --
I believe further speedups could be achieved by putting your large 3GB score files into a database. The program might look like this. It assumes you have created a separate table for each of the score files.
#!/usr/bin/perl
use strict;
use warnings;
use List::Util qw/ sum min max /;
use DBI;
my $dbh = DBI->connect("dbi:SQLite:dbname=junk.lite","","",
{PrintError => 1}) or die "Can't connect";
my $data = "data.txt";
open F, '<', $data or die "Unable to open '$data' for reading: $!";
open OUT, '>', 'outfile.csv'
or die "Unable to open 'outfile.csv' for writing: $!";
while(my $table = <F>) {
next if $table =~ /Header/;
chomp $table;
my ($chr, undef, $start, $end) = split /\t/, $table;
my $sql = <<SQL;
SELECT score
FROM $chr
WHERE pos >= ? AND pos <= ?
SQL
my $sth = $dbh->prepare($sql);
$sth->execute( $start, $end ) or die $DBI::errstr; ;
my $set = $sth->fetchall_arrayref;
my @scores = map $_->[0], @$set;
next unless @scores;
my $len = @scores; # number of scores
my $min = min @scores;
my $max = max @scores;
my $mean = sum(@scores) / @scores;
my $geo_mean = geo_mean(@scores);
print OUT join("\t",
$table, $mean, $geo_mean, $min, $max, $len), "\n";
}
close(F) or die "Unable to close '$data': $!";
close(OUT) or die "Unable to close 'outfile.csv' $!";
sub geo_mean {
my $prod = 1;
$prod *= $_ for @_;
return $prod ** (1/ @_);
}
|