Faster and more efficient way to read a file vertically

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks!
I have a file with millions of lines that look like this (DNA sequences):

ACATCACCTCCCACAACGAGGACTACACCATCGTGGAACA
ACATCACCTCCCACAACGAGGACTACACCATCGTGGAACA
ACATCACCTCCCACAACGAGGACTACACCATCGTGGAACA
ACATCACCTCCCACAACGAGGACTACACCATCGTGGAACA
ACATCACCTCCCACAACGAGGACTACACCATCGTGGAACA
ACATCACCTCCCACAACGAGGACTACACCATCGTGGAACA
ACATCACCTCCCACAACGAGGACTACACCATCGTGGAACA
ACATCACCTCCCACAACGAGGACTACACCATCGTGGAACA
ACATCACCTCCCACAACGAGGACTACACCATCGTGGAACA
ACATCACCTCCCACAACGAGGACTACACCATCGTGGAACA
ACATCACCTCCCACAACGAGGACTACACCATCGTGGAACA
ACATCACCTCCCACAACGAGGACTACACCATCGTGGAACA
ACATCACCTCCCACAACGAGGACTACACCATCGTGGAACA
[download]

My question is, how can I read it vertically, i.e. extract e.g. the 10th column? All lines are of the same length, but are not tab or space-separated where I could use the cut command. My approach would be to split each line and then keep only the 10th letter everywhere, but this takes enormous amount of time and I was hoping that it might be easier/faster to do somehow.
Any ideas?

Comment on Faster and more efficient way to read a file vertically Select or Download Code

Replies are listed 'Best First'.
Re: Faster and more efficient way to read a file vertically by BrowserUk (Patriarch) on Nov 03, 2017 at 20:12 UTC
If you make an array of substr references to the characters in a buffer, and then overlay each line into that buffer, the cost of performing the splitting/indexing of the strings is done once: `#! perl -slw use strict; my $c = $ARGV[ 0 ] // 25; my $buf = chr(0) x 62; my @cRefs = map \substr( $buf, $_, 1 ), 0 .. length( $buf )-1; until( eof( DATA ) ) { substr( $buf, 0 ) = <DATA>; print ${ $cRefs[ $c ] }; } __DATA__ ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789abcdefghijklmnopqrstuvwxyz` [download] A few runs: `C:\test>1202693 0 A A A A C:\test>1202693 25 Z Z Z Z C:\test>1202693 32 6 6 6 6 C:\test>1202693 61 z z z z` [download] With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". The enemy of (IT) success is complexity. In the absence of evidence, opinion is indistinguishable from prejudice. Suck that fhit	[reply] [d/l] [select]
Re: Faster and more efficient way to read a file vertically by choroba (Cardinal) on Nov 03, 2017 at 15:23 UTC
My `cut` (GNU 8.25) also supports the `-c` and `-b` options to only print the given character or byte range, respectively. ($q=q:Sq=~/;[c](.)(.)/;chr(-\|\|-\|5+lengthSq)`"S\|oS2"`map{chr \|+ord }map{substrSq`S_+\|`\|}3E\|-\|`7**2-3:)=~y+S\|`+$1,++print+eval$q,q,a, [download]	[reply] [d/l] [select]
Re^3: Faster and more efficient way to read a file vertically by Anonymous Monk on Nov 03, 2017 at 15:27 UTC
Great, I also saw it now! So basically I can say `cut -c 10` and get the 10th character. Thank you very much!	[reply] [d/l]
Re: Faster and more efficient way to read a file vertically by Laurent_R (Canon) on Nov 03, 2017 at 18:27 UTC
This is a perl one-liner doing just what you want: $ echo 'ACATCACCTCCCACAACGAGGACTACACCATCGTGGAACA > ACATCACCTCCCACAACGAGGACTACACCATCGTGGAACA > ACATCACCTCCCACAACGAGGACTACACCATCGTGGAACA > ACATCACCTCCCACAACGAGGACTACACCATCGTGGAACA > ACATCACCTCCCACAACGAGGACTACACCATCGTGGAACA > ACATCACCTCCCACAACGAGGACTACACCATCGTGGAACA > ACATCACCTCCCACAACGAGGACTACACCATCGTGGAACA > ACATCACCTCCCACAACGAGGACTACACCATCGTGGAACA > ACATCACCTCCCACAACGAGGACTACACCATCGTGGAACA > ACATCACCTCCCACAACGAGGACTACACCATCGTGGAACA > ACATCACCTCCCACAACGAGGACTACACCATCGTGGAACA > ACATCACCTCCCACAACGAGGACTACACCATCGTGGAACA > ACATCACCTCCCACAACGAGGACTACACCATCGTGGAACA' \| perl -nE 'say substr($_, + 10, 1);' C C C C C C C C C C C C C [download] Check, though, that 10 is the right second parameter for substr, you may have to change it depending on which character you want exactly.	[reply] [d/l]
Re^2: Faster and more efficient way to read a file vertically by dbuckhal (Chaplain) on Nov 03, 2017 at 22:50 UTC
another one-liner: `$ perl -F'' -anE 'say $F[9]'`	[reply] [d/l]
Re: Faster and more efficient way to read a file vertically -- updated by Discipulus (Canon) on Nov 03, 2017 at 16:09 UTC
Hello, million of lines still probably fit in memory.. Note that `$#{$aoa[0]}` assumes all lines are of the same length as you said. `use strict; use warnings; my @aoa; while (<DATA>) { chomp; push @aoa,[split '',$_]; } foreach my $col(0..$#{$aoa[0]}){ print "Column $col: ", (join ' ',map { $aoa[$_][$col] } 0..$#aoa), "\n"; } __DATA__ ACATCACCTC ACATCACCTC ACATCACCTC ACATCACCTC # out Column 0: A A A A Column 1: C C C C Column 2: A A A A Column 3: T T T T Column 4: C C C C Column 5: A A A A Column 6: C C C C Column 7: C C C C Column 8: T T T T Column 9: C C C C` [download] L* UPDATE if really care memory you can try the following (untested)approach: `# pseudocode!! # analize first line my $line = <$fh>; chomp $line; # compute last index of the future array (or future string? be aware o +f possible off one errors!!); my last = length $line - 1; # rewind the filehandle seek $fh,0,0; sub get_column{ my $col = shift; my $line = shift; if($col==0){$line=~/^(.)/} elsif($col==$last){$line=~/(.)$/} else{ $line=~/.{$col-1}(.)/} # or $last - $col? return $1; } while (<$fh>){ chomp; print get_column(3,$_) }` [download] There are no rules, there are no thumbs.. Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.	[reply] [d/l] [select]
Re^2: Faster and more efficient way to read a file vertically -- updated by Laurent_R (Canon) on Nov 03, 2017 at 17:59 UTC
million of lines still probably fit in memory. Maybe. Or maybe not. But why take the chance? Especially with an AoA which has some extra cost. It is so easy to do everything in the first loop, when reading each line. And BTW, it is also probably faster, because using an array of arrays implies copying the data once more.	[reply]
Re^3: Faster and more efficient way to read a file vertically -- updated by Discipulus (Canon) on Nov 03, 2017 at 18:05 UTC
Yes Laurent_R you are absolutely rigth and probably i gave a dumb answer. I not even looked other's replies carefully before posting: as only excuse i can say i was filling the bathtub.. ;=) If data must be accessed more times probably is worth to put into an sqlite db, a char per column and access it via SQL queries. No big memory overhead and super speed. L* There are no rules, there are no thumbs.. Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.	[reply] [d/l]
Re: Faster and more efficient way to read a file vertically by karlgoethebier (Abbot) on Nov 04, 2017 at 11:14 UTC
Stolen, cannibalized and slightly adopted from this older thread: Threads From Hell #2: How To Search A Very Huge File [SOLVED]: #!/usr/bin/env perl # http://www.perlmonks.org/?node_id=1202693 # $Id: loop.pl,v 1.2 2017/11/04 11:02:41 karl Exp karl $ use strict; use warnings; use MCE::Loop; use Time::HiRes qw( time ); use feature qw(say); my $file = q(data.txt); MCE::Loop::init( { max_workers => 4, use_slurpio => 1 } ); my $start = time; my @result = mce_loop_f { my $slurp_ref = $_[1]; my @column; open my $fh, '<', $slurp_ref; binmode $fh, ':raw'; while (<$fh>) { push @column, substr( $_, 10, 1 ) } close $fh; MCE->gather(@column); # sleep 2; } $file; say join( '', @result ); printf "Took %.3f seconds\n", time - $start; __END__ [download] Thanks to marioroy. See also MCE. Update: To avoid the call to `binmode` please see Encoding horridness revisited: What's going on here? [SOLVED]. Regards, Karl �The Crux of the Biscuit is the Apostrophe� `perl -MCrypt::CBC -E 'say Crypt::CBC->new(-key=>'kgb',-cipher=>"Blowfish")->decrypt_hex($ENV{KARL});'`Help	[reply] [d/l] [select]
Re: Faster and more efficient way to read a file vertically by thanos1983 (Parson) on Nov 03, 2017 at 19:03 UTC
Hello Anonymous Monk, Similar question to yours was asked at the Monastery before How do I get the Nth Character of a String?. Here are sample of codes from the relevant question: #!/usr/bin/perl use strict; use warnings; use Data::Dumper; use feature 'say'; # use Benchmark qw(:all) ; # WindowsOS use Benchmark::Forking qw( timethese cmpthese ); # UnixOS sub getn_unpack { return unpack "x" . ($_[1]-1) . "a", $_[0]; } sub getn_substr { return substr $_[0], $_[1]-1, 1; } sub getn_split { return +(split //, $_[0])[$_[1]-1]; } my $strNum = "12345678910"; my $string = "ACATCACCTCCCACAACGAGGACTACACCATCGTGGAACA"; # say getn_unpack($string, 10); # say getn_substr($string, 10); # say getn_split($string, 10); my $results = timethese(1000000000, { 'unpack' => getn_unpack($string, + 10), 'substr' => getn_substr($string, 10), 'split' => getn_split($string, 10), }, 'none'); cmpthese( $results ); __END__ $ perl test.pl Rate unpack substr split unpack 171232877/s -- -23% -31% substr 223713647/s 31% -- -10% split 248138958/s 45% 11% -- [download] It looks like the more efficient choice would be to use unpack. Something like that could do what you need. Reading one line at a time, extract the data that you want (one character) and finally push it into an array. Sample of code below: `#!/usr/bin/perl use strict; use warnings; use Data::Dumper; sub getn_unpack { return unpack "x" . ($_[1]-1) . "a", $_[0]; } my $file = 'data.txt'; my @array; if (open(my $fh, '<', $file)) { while (<$fh>) { chomp; push @array, getn_unpack($_, 10); } } else { warn "Could not open file '$file' $!\n"; } print Dumper \@array; __END__ $ cat data.txt ACATCACCTCCCACAACGAGGACTACACCATCGTGGAACA ACATCACCTACCACAACGAGGACTACACCATCGTGGAACA $ perl test.pl $VAR1 = [ 'C', 'A' ];` [download] Update: Thanks to fellow Monk karlgoethebier for observing my mistake I would suggest an alternative solution to your problem. Use split instead of unpack. See sample of code below: `#!/usr/bin/perl use strict; use warnings; use Data::Dumper; sub getn_split { return +(split //, $_[0])[$_[1]-1]; } my $file = 'data.txt'; my @array; if (open(my $fh, '<', $file)) { while (<$fh>) { chomp; push @array, getn_split($_, 10); } } else { warn "Could not open file '$file' $!\n"; } print Dumper \@array; __END__ $ cat data.txt ACATCACCTCCCACAACGAGGACTACACCATCGTGGAACA ACATCACCTACCACAACGAGGACTACACCATCGTGGAACA $ perl test.pl $VAR1 = [ 'C', 'A' ];` [download] Hope this helps, BR Seeking for Perl wisdom...on the process of learning...not there...yet!	[reply] [d/l] [select]
Re^2: Faster and more efficient way to read a file vertically by karlgoethebier (Abbot) on Nov 05, 2017 at 14:15 UTC
"...It looks like the more efficient choice would be to use unpack..." I'm not so sure. As you wrote: `$ perl test.pl Rate unpack substr split unpack 171232877/s -- -23% -31% substr 223713647/s 31% -- -10% split 248138958/s 45% 11% --` [download] Ergo: `karls-mac-mini:monks karl$ perl -e 'printf ("%.1f\n", 248138958/171232 +877);' 1.4` [download] As i wrote at Re^6: Question on Regex: "...use cmpthese, the results are sorted from slow to fast..." Sorry in advance if i ~~did something wrong~~ missed something. Best regards, Karl �The Crux of the Biscuit is the Apostrophe� `perl -MCrypt::CBC -E 'say Crypt::CBC->new(-key=>'kgb',-cipher=>"Blowfish")->decrypt_hex($ENV{KARL});'`Help	[reply] [d/l] [select]
Re^3: Faster and more efficient way to read a file vertically by thanos1983 (Parson) on Nov 06, 2017 at 09:43 UTC
Hello karlgoethebier, You are absolutely right. I also read the Benchmark/Optional-Exports where is clearly stated: `cmpthese ( COUNT, CODEHASHREF, [ STYLE ] ) Optionally calls timethese(), then outputs comparison chart. This: cmpthese( -1, { a => "++\$i", b => "\$i = 2" } ) ; outputs a chart like: Rate b a b 2831802/s -- -61% a 7208959/s 155% --` [download] This chart is sorted from slowest to fastest*, and shows the percent speed difference between each pair of tests. cmpthese can also be passed the data structure that timethese() returns: Thanks for correcting me I will also update my answer. Although to be honest I am kind of impressed how unpack is slower in comparison to substr and split. Thanks again for your time and effort, BR. Seeking for Perl wisdom...on the process of learning...not there...yet!	[reply] [d/l] [select]
Re: Faster and more efficient way to read a file vertically by johngg (Canon) on Nov 05, 2017 at 15:37 UTC
I put together a benchmark for most of the suggested solutions (or adaptations of them to get consistent results) and ran tests against an inline dataset of 50 lines with Test::More then with a 50,000 line file produced by this one-liner. `perl -E ' my @alpha = ( qw{ A C G T } ) x 5; push @alpha, qw{ . . }; say join q{}, map { $alpha[ rand @alpha ] } 1 .. 50 for 1 .. 50000;' > spw1202693.txt` [download] Here's the script. Read more... (7 kB) And the results. ok 1 - ANDmask ok 2 - brutish ok 3 - pushAoA ok 4 - regex ok 5 - rsubstr ok 6 - seek ok 7 - split ok 8 - substr ok 9 - unpack ok 10 - unpackM Rate pushAoA brutish split seek regex unpack substr rsubstr + unpackM ANDmask pushAoA 1.11/s -- -35% -61% -62% -91% -97% -98% -98% + -98% -99% brutish 1.71/s 55% -- -39% -41% -86% -95% -96% -96% + -97% -98% split 2.82/s 155% 65% -- -3% -77% -92% -94% -94% + -95% -97% seek 2.91/s 163% 70% 3% -- -76% -92% -94% -94% + -95% -97% regex 12.3/s 1010% 617% 336% 322% -- -65% -74% -75% + -79% -88% unpack 35.0/s 3060% 1943% 1141% 1102% 185% -- -25% -27% + -40% -67% substr 46.9/s 4137% 2638% 1564% 1512% 282% 34% -- -3% + -20% -55% rsubstr 48.2/s 4254% 2714% 1610% 1556% 292% 38% 3% -- + -18% -54% unpackM 58.7/s 5194% 3321% 1979% 1914% 377% 68% 25% 22% + -- -44% ANDmask 105/s 9407% 6045% 3634% 3517% 757% 201% 124% 118% + 80% -- 1..10 [download] The two substr solutions are neck and neck in the lead, unpack a distant third and everything else well behind. However, I have cocked up benchmarks before so take this with a pinch of salt! Update: Corrected attribution of the "unpack" method and incorporated the two new methods and benchmark results from this post. Working with multi-line buffers using unpack or a mask to AND with the buffer seems to be the fastest approach. Cheers, JohnGG	[reply] [d/l] [select]
Re^2: Faster and more efficient way to read a file vertically by vr (Curate) on Nov 05, 2017 at 17:22 UTC
Interesting. I had similar partial synthetic benchmark yesterday, thought to publish it mainly to advice against my "seek" solution as too slow, then decided not to :), because maybe it's not worth readers' effort. Nevertheless, somewhat different results for a 1 million lines file, and fast NVMe SSD storage. Below is the case for returning a hash with chars counts, but it's similar for returning string. `$ perl vert2.pl ok 1 - same results ok 2 - same results ok 3 - same results (warning: too few iterations for a reliable count) (warning: too few iterations for a reliable count) (warning: too few iterations for a reliable count) (warning: too few iterations for a reliable count) Rate seek buk substr slurp seek 0.920/s -- -61% -84% -88% buk 2.36/s 157% -- -58% -69% substr 5.66/s 515% 140% -- -26% slurp 7.69/s 736% 226% 36% -- 1..3` [download] Read more... (2 kB)	[reply] [d/l] [select]
Re^3: Faster and more efficient way to read a file vertically by marioroy (Prior) on Nov 05, 2017 at 21:35 UTC
The following provides a parallel version for the slurp routine. I'm not sure why or where to look, running MCE via cmpthese reports inaccurately with MCE being 300x faster which is wrong. So, I needed to benchmark another way. Regarding MCE, workers receive the next chunk and tally using a local hash. Then, update the shared hash. use strict; use warnings; use MCE; use MCE::Shared; use String::Random 'random_regex'; use Time::HiRes 'time'; my $fn = 'dna.txt'; my $POS = 10; my $shrcount = MCE::Shared->hash(); my $mce; unless ( -e $fn ) { open my $fh, '>', $fn; print $fh random_regex( '[ACTG]{42}' ), "\n" for 1 .. 1e6; } sub slurp { open my $fh, '<', $fn; my $s = do { local $/ = undef; <$fh> }; my $count; $count-> { substr $s, $POS - 1 + 43 * $_, 1 }++ for 0 .. length( $s ) / 43 - 1; return $count } sub mce { unless ( defined $mce ) { $mce = MCE->new( max_workers => 4, chunk_size => '300k', use_slurpio => 1, user_func => sub { my ( $mce, $slurp_ref, $chunk_id ) = @_; my ( $count, @todo ); $count-> { substr ${ $slurp_ref }, $POS - 1 + 43 * $_, 1 }++ for 0 .. length( ${ $slurp_ref } ) / 43 - 1; # Each key involves one IPC trip to the shared-manager. # # $shrcount->incrby( $_, $count->{$_} ) # for ( keys %{ $count } ); # The following is faster for smaller chunk size. # Basically, send multiple commands at once. # push @todo, [ "incrby", $_, $count->{$_} ] for ( keys %{ $count } ); $shrcount->pipeline( @todo ); } )->spawn(); } $shrcount->clear(); $mce->process($fn); return $shrcount->export(); } for (qw/ slurp mce /) { no strict 'refs'; my $start = time(); my $func = "main::$_"; $func->() for 1 .. 3; printf "%5s: %0.03f secs.\n", $_, time() - $start; } __END__ slurp: 0.487 secs. mce: 0.149 secs. [download]	[reply] [d/l]
Re^2: Faster and more efficient way to read a file vertically by LanX (Saint) on Nov 05, 2017 at 21:48 UTC
> `unpack => sub { # Suggested but not implemented by pryrt` Actually `unpack` was suggested (and not implemented) by me first. ;) FWIW: My idea was to unpack multiple lines simultaneously instead of going line by line. If you are interested and all lines really have the same length (the OP never clarified) read a chunk of complete lines bigger 4 or 8kb (depending on the blocksize of the OS to optimize read operations) run a repeated unpack pattern get a list of 1 result for each chunk line Please see if substr on single lines is still faster then. `$line_length += $newline_length; # OS dependend $line_count = int(8 * 1024 / $line_length) +1; $chunk_size = $line_count * line_length;` [download] And yes I'm still reluctant to implement it, smells too much like an XY Problem :) Cheers Rolf _{(addicted to the Perl Programming Language and ☆☆☆☆ :) Je suis Charlie!} update In hindsight... probably having a slightly smaller chunk is more efficient : `$line_count = int(8 * 1024 / $line_length)`	[reply] [d/l] [select]
Re^3: Faster and more efficient way to read a file vertically by johngg (Canon) on Nov 06, 2017 at 00:36 UTC
Actually unpack was suggested (and not implemented) by me first. ;) Ah! Sorry, I missed that :-/ Cheers, JohnGG	[reply]
Re^4: Faster and more efficient way to read a file vertically by LanX (Saint) on Nov 06, 2017 at 01:13 UTC
Re^5: Faster and more efficient way to read a file vertically by johngg (Canon) on Nov 06, 2017 at 17:32 UTC
Some notes below your chosen depth have not been shown here
Re: Faster and more efficient way to read a file vertically by vr (Curate) on Nov 03, 2017 at 17:48 UTC
If "same length", then straightforward and perhaps not perlish, and idea originated before Discipulus's answer :). I wonder how inefficient this is compared to slurping/reading in large blocks, i.e. if `read` and `seek` 'cooperate' on input buffer (I don't know enough on underlying C calls). `use strict; use warnings; use autodie; my $POS = 10; open my $fh, '<', 'dna.txt'; my $L = length( <$fh> ) - 1; seek $fh, $POS - 1, 0; my ( $s, $i ) = ( '', 0 ); seek $fh, $L, 1 while read $fh, $s, 1, $i++; print "$s\n";` [download]	[reply] [d/l] [select]
Re^2: Faster and more efficient way to read a file vertically by ForgotPasswordAgain (Priest) on Nov 03, 2017 at 20:58 UTC
FWIW, seek was my first thought, too. (Also that I'd prototype in Perl, then write the same thing in C. I might've found my weekend project... :) I can't imagine that allocating memory is going to help (I like when my imagination is challenged, though). I think at least if we can assume the file is in filesystem cache the read will be coming from RAM already anyway.	[reply]
Re^3: Faster and more efficient way to read a file vertically by ForgotPasswordAgain (Priest) on Nov 03, 2017 at 21:12 UTC
I think this is parallelizable, too. If you have 24 cores, you can seek to $L/24, do your thing, combine results.	[reply]
Re: Faster and more efficient way to read a file vertically by LanX (Saint) on Nov 03, 2017 at 15:15 UTC
> but this takes enormous amount of time what does this mean? Maybe it's just file access on the HD? Please show some reference code. > Any ideas? You can slurp the whole file and run a regex ... something like ~~`@col10 = /^.{9}(.)/g` on it (with the appropriate /s or /m modifier of course)~~ corrected `my @col = ( $file =~ /^.{9}(.)/mg );` Using `unpack` might be even faster, but I'm no expert here. Cheers Rolf _{(addicted to the Perl Programming Language and ☆☆☆☆ :) Je suis Charlie!}	[reply] [d/l] [select]
Re^2: Faster and more efficient way to read a file vertically by Anonymous Monk on Nov 03, 2017 at 15:19 UTC
So basically I have this (brute-force attack): while(<>) { if($_=~/^(.?)\t(.)/) { $read_seq=$1; $read_id=$2; @split_read=split(//, $read_seq); $respective_read_letter=$split_read[$i]; if($respective_read_letter eq 'A') {$count_A++;} elsif($respective_read_letter eq 'T') {$count_T++;} elsif($respective_read_letter eq 'C') {$count_C++;} elsif($respective_read_letter eq 'G') {$count_G++;} elsif($respective_read_letter eq '.') {$count_dot++;} else {print "ERROR in read: $read\t$respective_read_letter\ +n";} } } $total=$count_A+$count_T+$count_C+$count_G+$count_dot; $fraction_A = sprintf("%.2f", 100($count_A/$total)); $fraction_T = sprintf("%.2f", 100($count_T/$total)); $fraction_C = sprintf("%.2f", 100($count_C/$total)); $fraction_G = sprintf("%.2f", 100($count_G/$total)); $fraction_dot = sprintf("%.2f", 100*($count_dot/$total)); print $actual_pos,"\t",$expected_letter,"\t",$fraction_A,"\t",$fra +ction_T,"\t",$fraction_G,"\t",$fraction_C,"\t",$fraction_dot,"\n"; [download]	[reply] [d/l]
Re^3: Faster and more efficient way to read a file vertically by pryrt (Abbot) on Nov 03, 2017 at 16:15 UTC
If you're really only going to be doing one column, but want it to be chosen by the variable `$i`, I'd suggest substr: `$respective_read_letter = substr $read_seq, $i, 1;`. If finding an optimum solution is important to you (ie, if you'll use this script many times for the forseeable future, rather than just once or twice where "fast engouh" is fast enough), then I'd recommend Benchmarking the substr vs unpack vs LanX's regex (and any others that are suggested). But whatever you do, make sure to use ++LanX's hash `%count`. use warnings; use strict; use Benchmark qw/cmpthese/; use Test::More tests => 1; my @dataset = (); push @dataset, join('', map { (qw/A C G T/)[rand 4] } 1 .. 30 ) for 1 +.. 1000; my $i = $ARGV[0] // 10; sub test { my $fnref = shift; my $count; for my $read_seq( @dataset ) { my $letter = $fnref->($read_seq, $i); $count->{$letter}++; } return $count; } sub rfn { test( sub { my $skip = $_[1]; $_[0] =~ /.{$skip}(.)/; return $1; }); }; sub sfn { test( sub { substr $_[0], $_[1], 1; }); }; sub ufn { test( sub { ... # I'm no unpack expert }); }; cmpthese(0, { regex => \&rfn, substr => \&sfn, #unpack => \&ufn, }); is_deeply rfn(), sfn(), 'same results'; [download]	[reply] [d/l] [select]
Re^3: Faster and more efficient way to read a file vertically by LanX (Saint) on Nov 03, 2017 at 15:26 UTC
`$i` is variable in your example. Reading vertically doesn't make sense then. I'd suggest `$count{$letter}++` with a hash `%count` to speed things up. Cheers Rolf _{(addicted to the Perl Programming Language and ☆☆☆☆ :) Je suis Charlie!}	[reply] [d/l] [select]
Re: Faster and more efficient way to read a file vertically by Anonymous Monk on Nov 04, 2017 at 10:03 UTC
If speed is of high priority, one shouldn't overlook the mmap() approach using File::Map. It has its limitations (no piped data) but it allows regular files to be efficiently handled as one big string.	[reply]
Re: Faster and more efficient way to read a file vertically by wazat (Monk) on Nov 04, 2017 at 18:56 UTC
OOPS, I see that yr already identified this approach. If your lines are really all the same length, you could do the job via a seek() / read() loop. The example below needs error checking. I haven't done any speed tests. #!/usr/bin/perl use strict; use warnings; my $linesep_len = length($/); my $rec_len = length('ACATCACCTCCCACAACGAGGACTACACCATCGTGGAACA') + $li +nesep_len; my $read_len = 1; my $skip_len = $rec_len - $read_len; binmode(DATA); seek(DATA, 10, 1) or die "seek error"; my $buf = ' ' x $read_len; while (read(DATA, $buf, $read_len) > 0) { print $buf, "\n"; seek(DATA, $skip_len, 1) or last; } __DATA__ ACATCACCTCCCACAACGAGGACTACACCATCGTGGAACA ACATCACCTCxCACAACGAGGACTACACCATCGTGGAACA ACATCACCTCsCACAACGAGGACTACACCATCGTGGAACA ACATCACCTCjCACAACGAGGACTACACCATCGTGGAACA ACATCACCTCcCACAACGAGGACTACACCATCGTGGAACA ACATCACCTCCCACAACGAGGACTACACCATCGTGGAACA ACATCACCTCCCACAACGAGGACTACACCATCGTGGAACA ACATCACCTCCCACAACGAGGACTACACCATCGTGGAACA ACATCACCTCCCACAACGAGGACTACACCATCGTGGAACA ACATCACCTC-CACAACGAGGACTACACCATCGTGGAACA [download] Output: `C x s j c C C C C -` [download]	[reply] [d/l] [select]
Re: Faster and more efficient way to read a file vertically by Anonymous Monk on Nov 04, 2017 at 18:43 UTC
Congrats on the new job!	[reply]
Re^2: Faster and more efficient way to read a file vertically by karlgoethebier (Abbot) on Nov 06, 2017 at 09:36 UTC
"...new job" Fake News. �The Crux of the Biscuit is the Apostrophe� `perl -MCrypt::CBC -E 'say Crypt::CBC->new(-key=>'kgb',-cipher=>"Blowfish")->decrypt_hex($ENV{KARL});'`Help	[reply] [d/l]

Back to Seekers of Perl Wisdom

update