Re^3: Counting no of '>' symbol in a file

I was curious so I ran some benchmarks. The substitution approach holds its own against the matching approach and I find it easier to code from memory. But the grep() and tr/// options are very good too.

This wasn't the most scientific comparison. I threw in a bunch of things I was interested in. Some approaches are faster if you slurp the file some if you do line by line and another if you make an array. Compared to reading the file off the disc any of these should be good enough for an average application.

use warnings;
use strict;
use Benchmark qw( timethese cmpthese );

my $sample = ">gi|5524211|gb|AAD44166.1| cytochrome b [Elephas maximus
+ maximus]
LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV
EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG
LLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSIVIL
GLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPYTIIGQMASILYFSIILAFLPIAGX
IENY
>SEQUENCE_1
MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEG
LVSVKVSDDFTIAAMRPSYLSYEDLDMTFVENEYKALVAELEKENEERRRLKDPNKPEHK
IPQFASRKQLSDAILKEAEEKIKEELKAQGKPEKIWDNIIPGKMNSFIADNSQLDSKLTL
MGQFYVMDDKKTVEQVIAEKEKEFGGKIKIVEFICFEVGEGLEKKTEDFAAEVAAQL
";

foreach(1..10){
    $sample .= $sample;
}
my @sample_array = split "\n", $sample;

sub tr1_count_ar {
    return $sample =~ tr/>//;
}
sub tr2_count_ar {
    my $c;
    foreach(@sample_array) {
        $c += $_ =~ tr/>//;
    }
    return $c;
}
sub s1_count_ar {
    my $c;
    foreach(@sample_array) {
        $c += $_ =~ s/^>/>/g;
    }
    return $c;
}
sub s2_count_ar {
    return $sample =~ s/>/>/g;
}
sub m1_count_ar {
    my $c;
    $c += () = $sample =~ m/>/g;
    return $c;
}
sub m2_count_ar {
    my $c;
    foreach(@sample_array) {
        $c += () = $_ =~ m/^>/;
    }
    return $c;
}
sub m3_count_ar {
    my $c;
    foreach(@sample_array) {
        $c++ if $_ =~ m/^>/;
    }
    return $c;
}
sub g_count_ar {
    my $c;
    $c = scalar grep /^>/, @sample_array;
    return $c;
}

print "tr1: ", tr1_count_ar( ) ,"\n";
print "tr2: ", tr2_count_ar( ) ,"\n";
print "s1: ",  s1_count_ar( ) ,"\n";
print "s2: ",  s2_count_ar( ) ,"\n";
print "m1: ",  m1_count_ar( ) ,"\n";
print "m2: ",  m2_count_ar( ) ,"\n";
print "m3: ",  m3_count_ar( ) ,"\n";
print "g:  ",  g_count_ar( )  ,"\n";

print '-'x80,"\n";
print join "\n", @sample_array[0..6];
print "\n",'-'x80,"\n";

my $results = timethese (
    -10,
    {
        'tr1'  => 'tr1_count_ar',
        'tr2'  => 'tr2_count_ar',
        's1'   => 's1_count_ar',
        's2'   => 's2_count_ar',
        'm1'   => 'm1_count_ar',
        'm2'   => 'm2_count_ar',
        'm3'   => 'm3_count_ar',
        'grep' => 'g_count_ar'
    }
);
cmpthese($results);
[download]

The results:

tr1: 2048
tr2: 2048
s1: 2048
s2: 2048
m1: 2048
m2: 2048
m3: 2048
g:  2048
----------------------------------------------------------------------
+----------
>gi|5524211|gb|AAD44166.1| cytochrome b [Elephas maximus maximus]
LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV
EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG
LLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSIVIL
GLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPYTIIGQMASILYFSIILAFLPIAGX
IENY
>SEQUENCE_1
----------------------------------------------------------------------
+----------
Benchmark: running grep, m1, m2, m3, s1, s2, tr1, tr2 for at least 10 
+CPU seconds...
      grep: 11 wallclock secs (10.25 usr +  0.00 sys = 10.25 CPU) @ 89
+4.04/s (n=9163)
        m1: 11 wallclock secs (10.81 usr +  0.00 sys = 10.81 CPU) @ 72
+8.89/s (n=7880)
        m2: 10 wallclock secs (10.58 usr +  0.00 sys = 10.58 CPU) @ 38
+5.18/s (n=4074)
        m3: 11 wallclock secs (10.47 usr +  0.00 sys = 10.47 CPU) @ 59
+9.25/s (n=6273)
        s1: 11 wallclock secs (10.44 usr +  0.00 sys = 10.44 CPU) @ 43
+2.16/s (n=4510)
        s2: 11 wallclock secs (10.39 usr +  0.00 sys = 10.39 CPU) @ 92
+6.56/s (n=9626)
       tr1: 10 wallclock secs (10.52 usr +  0.00 sys = 10.52 CPU) @ 24
+73.89/s (n=26013)
       tr2: 11 wallclock secs (10.64 usr +  0.00 sys = 10.64 CPU) @ 58
+1.26/s (n=6184)
       Rate   m2   s1  tr2   m3   m1 grep   s2  tr1
m2    385/s   -- -11% -34% -36% -47% -57% -58% -84%
s1    432/s  12%   -- -26% -28% -41% -52% -53% -83%
tr2   581/s  51%  35%   --  -3% -20% -35% -37% -77%
m3    599/s  56%  39%   3%   -- -18% -33% -35% -76%
m1    729/s  89%  69%  25%  22%   -- -18% -21% -71%
grep  894/s 132% 107%  54%  49%  23%   --  -4% -64%
s2    927/s 141% 114%  59%  55%  27%   4%   -- -63%
tr1  2474/s 542% 472% 326% 313% 239% 177% 167%   --
[download]

Comment on Re^3: Counting no of '>' symbol in a file Select or Download Code


"be consistent"
	PerlMonks