Beefy Boxes and Bandwidth Generously Provided by pair Networks
Come for the quick hacks, stay for the epiphanies.
 
PerlMonks  

speeding up parsing, jump to line

by cburger (Acolyte)
on Aug 15, 2014 at 02:30 UTC ( [id://1097507]=perlquestion: print w/replies, xml ) Need Help??

cburger has asked for the wisdom of the Perl Monks concerning the following question:

I'm parsing one input file for human chromosome locations, retrieve start and end position and then read in the chromosome file with scores to filter out a score at the specific locations - then calculate a mean score value for the entire region. The process is terribly inefficient, even if I split files. Each chromosome file is already around 3 GB. is there a better approach to this problem? Perhaps can I jump to the right locations in a speedier way? Help appreciated.

open (F, $data) || die; while(<F>) { chomp; if ($_!~/Header/g) { @line=split("\t", $_); $table=$_; $chr=$line[0]; $start=$line[2]; $end=$line[3]; # file name $name="score.txt"; @array=(); @sorted=(); #print "$chr\.$name\n"; open (R, "$chr\.$name") || die ("cannot open chrom file\n"); while(<R>) { chomp; # format : chrom ID, position, score # chr1 10925 0.239 @q=split("\t", $_); $pos=$q[1]; $score=$q[2]; #print "$pos\n"; if ($pos>=$start) { if ($pos<=$end) { push (@array, $score); $sum=$sum+$score; }} } close(R); # get mean min and max $len=$#array+1; @sorted=sort {$a <=> $b} @array; $min=$sorted[0]; $max=$sorted[$#sorted]; $mean=$sum/$len; $stat = Statistics::Descriptive::Full->new(); $stat->add_data(@sorted); $geo_mean= $stat->geometric_mean(); #print "$table\t$geo_mean\t$min\t$max\t$len\n"; print OUT "$table\t$mean\t$geo_mean\t$min\t$max\t$len\n"; } } close(F); close(OUT);

Replies are listed 'Best First'.
Re: speeding up parsing, jump to line
by dave_the_m (Monsignor) on Aug 15, 2014 at 09:28 UTC
    How many lines (approx) are there in the $data file? Are the chromosome files already sorted by pos or not? I presume that there are approx 23 chromosome files?

    Dave.

      there are 40,000 lines for the data file and each chromosome has about 50-200 Mio lines. At the moment the script takes ~20 min to print out one line

Re: speeding up parsing, jump to line
by graff (Chancellor) on Aug 15, 2014 at 13:19 UTC
    The answers to dave_the_m's questions will determine what sort of solution to look for. E.g. if the $data file is relatively small, you can load it into memory, then read each *.score.txt file exactly once to load up relevant info for each item in $data, then do stats on the info - something like this:
    use strict; use warnings; my $data = "some/file.name"; open( F, $data ) or die "$data:$!\n"; my %targets; while(<F>) { next if ( /Header/ ); chomp; my ( $chr, $start, $end ) = ( split( /\t/ ))[0,2,3]; push @{$targets{$chr}}, { table => $_, start => $start, end => $en +d }; } for my $chr ( keys %targets ) { open( R, "$chr.score.txt" ) or die "$chr.score.txt: $!\n"; while (<R>) { chomp; my ( $pos, $score ) = ( split( /\t/ ))[1,2]; for my $range ( @{$targets{$chr}} ) { if ( $pos >= $$range{start} and $pos <= $$range{end} ) { push @{$$range{scores}}, $score; } } } } # do statistics on contents of %targets…
    If $data contains too much stuff, and/or requires too much stuff from the *.score.txt files to be held in memory for each $chr, then maybe you have to create output files for each $chr (or each start-end range in $data), so that it'll be quick/easy to do stats on those files.
Re: speeding up parsing, jump to line
by Cristoforo (Curate) on Aug 18, 2014 at 19:45 UTC
    Just one way I see that you can get a speedup is to not use Statistics::Descriptive (for the geometric mean). You can roll your own.

    Here is a benchmark - it compares your method with one where I didn't use Statistics::Descriptive;

    Depending on the number of scores in the array, my method shows a speedup of 12 times, for 1000 scores to 20 times, for 100 scores in the array.

    #!/usr/bin/perl use strict; use warnings; use Statistics::Descriptive; use Benchmark qw/ cmpthese /; use List::Util qw/ sum min max /; for my $len (100, 500, 1000) { my @scores = map {sprintf "%.3f", rand 29} 1 .. $len; print "\n\nNumber of scores: $len\n"; my $results = cmpthese (-1, { stat => sub { my $stat = Statistics::Descriptive::Full->new(); $stat->add_data(@scores); @scores = sort {$a <=> $b} @scores; my ($min, $max) = @scores[0, -1]; my $len = @scores; # number of scores my $mean = sum(@scores) / @scores; my $geo_mean = $stat->geometric_mean(); }, util => sub { my $len = @scores; # number of scores my $min = min @scores; my $max = max @scores; my $mean = sum(@scores) / @scores; my $geo_mean = geo_mean(@scores); }, }, ); } sub geo_mean { my $prod = 1; $prod *= $_ for @_; return $prod ** (1/ @_); }
    This benchmark produced the following output.
    Number of scores: 100 Rate stat util stat 5265/s -- -95% util 103936/s 1874% -- Number of scores: 500 Rate stat util stat 1652/s -- -93% util 24549/s 1386% -- Number of scores: 1000 Rate stat util stat 890/s -- -93% util 12274/s 1278% --
    I believe further speedups could be achieved by putting your large 3GB score files into a database. The program might look like this. It assumes you have created a separate table for each of the score files.
    #!/usr/bin/perl use strict; use warnings; use List::Util qw/ sum min max /; use DBI; my $dbh = DBI->connect("dbi:SQLite:dbname=junk.lite","","", {PrintError => 1}) or die "Can't connect"; my $data = "data.txt"; open F, '<', $data or die "Unable to open '$data' for reading: $!"; open OUT, '>', 'outfile.csv' or die "Unable to open 'outfile.csv' for writing: $!"; while(my $table = <F>) { next if $table =~ /Header/; chomp $table; my ($chr, undef, $start, $end) = split /\t/, $table; my $sql = <<SQL; SELECT score FROM $chr WHERE pos >= ? AND pos <= ? SQL my $sth = $dbh->prepare($sql); $sth->execute( $start, $end ) or die $DBI::errstr; ; my $set = $sth->fetchall_arrayref; my @scores = map $_->[0], @$set; next unless @scores; my $len = @scores; # number of scores my $min = min @scores; my $max = max @scores; my $mean = sum(@scores) / @scores; my $geo_mean = geo_mean(@scores); print OUT join("\t", $table, $mean, $geo_mean, $min, $max, $len), "\n"; } close(F) or die "Unable to close '$data': $!"; close(OUT) or die "Unable to close 'outfile.csv' $!"; sub geo_mean { my $prod = 1; $prod *= $_ for @_; return $prod ** (1/ @_); }
Re: speeding up parsing, jump to line
by Anonymous Monk on Aug 15, 2014 at 17:20 UTC

    If the score-file entries are distinguished by a unique position, you essentially have a (sparse) array $score[$pos]. You'll want to build a temporary database (mapping) for each of the score files.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1097507]
Approved by GrandFather
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chanting in the Monastery: (4)
As of 2024-04-19 15:48 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found