Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

Re^3: Counting no of '>' symbol in a file

by Lotus1 (Vicar)
on Sep 01, 2015 at 18:44 UTC ( [id://1140708]=note: print w/replies, xml ) Need Help??


in reply to Re^2: Counting no of '>' symbol in a file
in thread Counting no of '>' symbol in a file

I was curious so I ran some benchmarks. The substitution approach holds its own against the matching approach and I find it easier to code from memory. But the grep() and tr/// options are very good too.

This wasn't the most scientific comparison. I threw in a bunch of things I was interested in. Some approaches are faster if you slurp the file some if you do line by line and another if you make an array. Compared to reading the file off the disc any of these should be good enough for an average application.

use warnings; use strict; use Benchmark qw( timethese cmpthese ); my $sample = ">gi|5524211|gb|AAD44166.1| cytochrome b [Elephas maximus + maximus] LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG LLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSIVIL GLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPYTIIGQMASILYFSIILAFLPIAGX IENY >SEQUENCE_1 MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEG LVSVKVSDDFTIAAMRPSYLSYEDLDMTFVENEYKALVAELEKENEERRRLKDPNKPEHK IPQFASRKQLSDAILKEAEEKIKEELKAQGKPEKIWDNIIPGKMNSFIADNSQLDSKLTL MGQFYVMDDKKTVEQVIAEKEKEFGGKIKIVEFICFEVGEGLEKKTEDFAAEVAAQL "; foreach(1..10){ $sample .= $sample; } my @sample_array = split "\n", $sample; sub tr1_count_ar { return $sample =~ tr/>//; } sub tr2_count_ar { my $c; foreach(@sample_array) { $c += $_ =~ tr/>//; } return $c; } sub s1_count_ar { my $c; foreach(@sample_array) { $c += $_ =~ s/^>/>/g; } return $c; } sub s2_count_ar { return $sample =~ s/>/>/g; } sub m1_count_ar { my $c; $c += () = $sample =~ m/>/g; return $c; } sub m2_count_ar { my $c; foreach(@sample_array) { $c += () = $_ =~ m/^>/; } return $c; } sub m3_count_ar { my $c; foreach(@sample_array) { $c++ if $_ =~ m/^>/; } return $c; } sub g_count_ar { my $c; $c = scalar grep /^>/, @sample_array; return $c; } print "tr1: ", tr1_count_ar( ) ,"\n"; print "tr2: ", tr2_count_ar( ) ,"\n"; print "s1: ", s1_count_ar( ) ,"\n"; print "s2: ", s2_count_ar( ) ,"\n"; print "m1: ", m1_count_ar( ) ,"\n"; print "m2: ", m2_count_ar( ) ,"\n"; print "m3: ", m3_count_ar( ) ,"\n"; print "g: ", g_count_ar( ) ,"\n"; print '-'x80,"\n"; print join "\n", @sample_array[0..6]; print "\n",'-'x80,"\n"; my $results = timethese ( -10, { 'tr1' => 'tr1_count_ar', 'tr2' => 'tr2_count_ar', 's1' => 's1_count_ar', 's2' => 's2_count_ar', 'm1' => 'm1_count_ar', 'm2' => 'm2_count_ar', 'm3' => 'm3_count_ar', 'grep' => 'g_count_ar' } ); cmpthese($results);

The results:

tr1: 2048 tr2: 2048 s1: 2048 s2: 2048 m1: 2048 m2: 2048 m3: 2048 g: 2048 ---------------------------------------------------------------------- +---------- >gi|5524211|gb|AAD44166.1| cytochrome b [Elephas maximus maximus] LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG LLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSIVIL GLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPYTIIGQMASILYFSIILAFLPIAGX IENY >SEQUENCE_1 ---------------------------------------------------------------------- +---------- Benchmark: running grep, m1, m2, m3, s1, s2, tr1, tr2 for at least 10 +CPU seconds... grep: 11 wallclock secs (10.25 usr + 0.00 sys = 10.25 CPU) @ 89 +4.04/s (n=9163) m1: 11 wallclock secs (10.81 usr + 0.00 sys = 10.81 CPU) @ 72 +8.89/s (n=7880) m2: 10 wallclock secs (10.58 usr + 0.00 sys = 10.58 CPU) @ 38 +5.18/s (n=4074) m3: 11 wallclock secs (10.47 usr + 0.00 sys = 10.47 CPU) @ 59 +9.25/s (n=6273) s1: 11 wallclock secs (10.44 usr + 0.00 sys = 10.44 CPU) @ 43 +2.16/s (n=4510) s2: 11 wallclock secs (10.39 usr + 0.00 sys = 10.39 CPU) @ 92 +6.56/s (n=9626) tr1: 10 wallclock secs (10.52 usr + 0.00 sys = 10.52 CPU) @ 24 +73.89/s (n=26013) tr2: 11 wallclock secs (10.64 usr + 0.00 sys = 10.64 CPU) @ 58 +1.26/s (n=6184) Rate m2 s1 tr2 m3 m1 grep s2 tr1 m2 385/s -- -11% -34% -36% -47% -57% -58% -84% s1 432/s 12% -- -26% -28% -41% -52% -53% -83% tr2 581/s 51% 35% -- -3% -20% -35% -37% -77% m3 599/s 56% 39% 3% -- -18% -33% -35% -76% m1 729/s 89% 69% 25% 22% -- -18% -21% -71% grep 894/s 132% 107% 54% 49% 23% -- -4% -64% s2 927/s 141% 114% 59% 55% 27% 4% -- -63% tr1 2474/s 542% 472% 326% 313% 239% 177% 167% --

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1140708]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others wandering the Monastery: (5)
As of 2024-04-23 00:30 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found