Re: Faster and more efficient way to read a file vertically
by BrowserUk (Patriarch) on Nov 03, 2017 at 20:12 UTC
|
If you make an array of substr references to the characters in a buffer, and then overlay each line into that buffer, the cost of performing the splitting/indexing of the strings is done once:
#! perl -slw
use strict;
my $c = $ARGV[ 0 ] // 25;
my $buf = chr(0) x 62;
my @cRefs = map \substr( $buf, $_, 1 ), 0 .. length( $buf )-1;
until( eof( DATA ) ) {
substr( $buf, 0 ) = <DATA>;
print ${ $cRefs[ $c ] };
}
__DATA__
ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789abcdefghijklmnopqrstuvwxyz
ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789abcdefghijklmnopqrstuvwxyz
ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789abcdefghijklmnopqrstuvwxyz
ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789abcdefghijklmnopqrstuvwxyz
A few runs: C:\test>1202693 0
A
A
A
A
C:\test>1202693 25
Z
Z
Z
Z
C:\test>1202693 32
6
6
6
6
C:\test>1202693 61
z
z
z
z
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
In the absence of evidence, opinion is indistinguishable from prejudice.
Suck that fhit
| [reply] [d/l] [select] |
Re: Faster and more efficient way to read a file vertically
by choroba (Cardinal) on Nov 03, 2017 at 15:23 UTC
|
My cut (GNU 8.25) also supports the -c and -b options to only print the given character or byte range, respectively.
($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord
}map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,
| [reply] [d/l] [select] |
|
Great, I also saw it now!
So basically I can say cut -c 10 and get the 10th character. Thank you very much!
| [reply] [d/l] |
Re: Faster and more efficient way to read a file vertically
by Laurent_R (Canon) on Nov 03, 2017 at 18:27 UTC
|
This is a perl one-liner doing just what you want:
$ echo 'ACATCACCTCCCACAACGAGGACTACACCATCGTGGAACA
> ACATCACCTCCCACAACGAGGACTACACCATCGTGGAACA
> ACATCACCTCCCACAACGAGGACTACACCATCGTGGAACA
> ACATCACCTCCCACAACGAGGACTACACCATCGTGGAACA
> ACATCACCTCCCACAACGAGGACTACACCATCGTGGAACA
> ACATCACCTCCCACAACGAGGACTACACCATCGTGGAACA
> ACATCACCTCCCACAACGAGGACTACACCATCGTGGAACA
> ACATCACCTCCCACAACGAGGACTACACCATCGTGGAACA
> ACATCACCTCCCACAACGAGGACTACACCATCGTGGAACA
> ACATCACCTCCCACAACGAGGACTACACCATCGTGGAACA
> ACATCACCTCCCACAACGAGGACTACACCATCGTGGAACA
> ACATCACCTCCCACAACGAGGACTACACCATCGTGGAACA
> ACATCACCTCCCACAACGAGGACTACACCATCGTGGAACA' | perl -nE 'say substr($_,
+ 10, 1);'
C
C
C
C
C
C
C
C
C
C
C
C
C
Check, though, that 10 is the right second parameter for substr, you may have to change it depending on which character you want exactly. | [reply] [d/l] |
|
| [reply] [d/l] |
Re: Faster and more efficient way to read a file vertically -- updated
by Discipulus (Canon) on Nov 03, 2017 at 16:09 UTC
|
Hello,
million of lines still probably fit in memory.. Note that $#{$aoa[0]} assumes all lines are of the same length as you said.
use strict;
use warnings;
my @aoa;
while (<DATA>) {
chomp;
push @aoa,[split '',$_];
}
foreach my $col(0..$#{$aoa[0]}){
print "Column $col: ",
(join ' ',map { $aoa[$_][$col] } 0..$#aoa),
"\n";
}
__DATA__
ACATCACCTC
ACATCACCTC
ACATCACCTC
ACATCACCTC
# out
Column 0: A A A A
Column 1: C C C C
Column 2: A A A A
Column 3: T T T T
Column 4: C C C C
Column 5: A A A A
Column 6: C C C C
Column 7: C C C C
Column 8: T T T T
Column 9: C C C C
L*
UPDATE if really care memory you can try the following (*untested*)approach:
# pseudocode!!
# analize first line
my $line = <$fh>;
chomp $line;
# compute last index of the future array (or future string? be aware o
+f possible off one errors!!);
my last = length $line - 1;
# rewind the filehandle
seek $fh,0,0;
sub get_column{
my $col = shift;
my $line = shift;
if($col==0){$line=~/^(.)/}
elsif($col==$last){$line=~/(.)$/}
else{ $line=~/.{$col-1}(.)/} # or $last - $col?
return $1;
}
while (<$fh>){
chomp;
print get_column(3,$_)
}
There are no rules, there are no thumbs..
Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.
| [reply] [d/l] [select] |
|
million of lines still probably fit in memory.
Maybe. Or maybe not. But why take the chance? Especially with an AoA which has some extra cost. It is so easy to do everything in the first loop, when reading each line. And BTW, it is also probably faster, because using an array of arrays implies copying the data once more.
| [reply] |
|
| [reply] [d/l] |
Re: Faster and more efficient way to read a file vertically
by karlgoethebier (Abbot) on Nov 04, 2017 at 11:14 UTC
|
#!/usr/bin/env perl
# http://www.perlmonks.org/?node_id=1202693
# $Id: loop.pl,v 1.2 2017/11/04 11:02:41 karl Exp karl $
use strict;
use warnings;
use MCE::Loop;
use Time::HiRes qw( time );
use feature qw(say);
my $file = q(data.txt);
MCE::Loop::init( { max_workers => 4, use_slurpio => 1 } );
my $start = time;
my @result = mce_loop_f {
my $slurp_ref = $_[1];
my @column;
open my $fh, '<', $slurp_ref;
binmode $fh, ':raw';
while (<$fh>) { push @column, substr( $_, 10, 1 ) }
close $fh;
MCE->gather(@column);
# sleep 2;
}
$file;
say join( '', @result );
printf "Took %.3f seconds\n", time - $start;
__END__
Thanks to marioroy.
See also MCE.
Update: To avoid the call to binmode please see Encoding horridness revisited: What's going on here? [SOLVED].
Regards, Karl
«The Crux of the Biscuit is the Apostrophe»
perl -MCrypt::CBC -E 'say Crypt::CBC->new(-key=>'kgb',-cipher=>"Blowfish")->decrypt_hex($ENV{KARL});'Help
| [reply] [d/l] [select] |
Re: Faster and more efficient way to read a file vertically
by thanos1983 (Parson) on Nov 03, 2017 at 19:03 UTC
|
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
use feature 'say';
# use Benchmark qw(:all) ; # WindowsOS
use Benchmark::Forking qw( timethese cmpthese ); # UnixOS
sub getn_unpack {
return unpack "x" . ($_[1]-1) . "a", $_[0];
}
sub getn_substr {
return substr $_[0], $_[1]-1, 1;
}
sub getn_split {
return +(split //, $_[0])[$_[1]-1];
}
my $strNum = "12345678910";
my $string = "ACATCACCTCCCACAACGAGGACTACACCATCGTGGAACA";
# say getn_unpack($string, 10);
# say getn_substr($string, 10);
# say getn_split($string, 10);
my $results = timethese(1000000000, { 'unpack' => getn_unpack($string,
+ 10),
'substr' => getn_substr($string, 10),
'split' => getn_split($string, 10),
}, 'none');
cmpthese( $results );
__END__
$ perl test.pl
Rate unpack substr split
unpack 171232877/s -- -23% -31%
substr 223713647/s 31% -- -10%
split 248138958/s 45% 11% --
It looks like the more efficient choice would be to use unpack. Something like that could do what you need. Reading one line at a time, extract the data that you want (one character) and finally push it into an array. Sample of code below:
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
sub getn_unpack {
return unpack "x" . ($_[1]-1) . "a", $_[0];
}
my $file = 'data.txt';
my @array;
if (open(my $fh, '<', $file)) {
while (<$fh>) {
chomp;
push @array, getn_unpack($_, 10);
}
} else {
warn "Could not open file '$file' $!\n";
}
print Dumper \@array;
__END__
$ cat data.txt
ACATCACCTCCCACAACGAGGACTACACCATCGTGGAACA
ACATCACCTACCACAACGAGGACTACACCATCGTGGAACA
$ perl test.pl
$VAR1 = [
'C',
'A'
];
Update: Thanks to fellow Monk karlgoethebier for observing my mistake I would suggest an alternative solution to your problem. Use split instead of unpack. See sample of code below:
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
sub getn_split {
return +(split //, $_[0])[$_[1]-1];
}
my $file = 'data.txt';
my @array;
if (open(my $fh, '<', $file)) {
while (<$fh>) {
chomp;
push @array, getn_split($_, 10);
}
} else {
warn "Could not open file '$file' $!\n";
}
print Dumper \@array;
__END__
$ cat data.txt
ACATCACCTCCCACAACGAGGACTACACCATCGTGGAACA
ACATCACCTACCACAACGAGGACTACACCATCGTGGAACA
$ perl test.pl
$VAR1 = [
'C',
'A'
];
Hope this helps, BR
Seeking for Perl wisdom...on the process of learning...not there...yet!
| [reply] [d/l] [select] |
|
$ perl test.pl
Rate unpack substr split
unpack 171232877/s -- -23% -31%
substr 223713647/s 31% -- -10%
split 248138958/s 45% 11% --
Ergo:
karls-mac-mini:monks karl$ perl -e 'printf ("%.1f\n", 248138958/171232
+877);'
1.4
As i wrote at Re^6: Question on Regex:
"...use cmpthese, the results are sorted from slow to fast..."
Sorry in advance if i did something wrong missed something.
Best regards, Karl
«The Crux of the Biscuit is the Apostrophe»
perl -MCrypt::CBC -E 'say Crypt::CBC->new(-key=>'kgb',-cipher=>"Blowfish")->decrypt_hex($ENV{KARL});'Help
| [reply] [d/l] [select] |
|
cmpthese ( COUNT, CODEHASHREF, [ STYLE ] )
Optionally calls timethese(), then outputs comparison chart. This:
cmpthese( -1, { a => "++\$i", b => "\$i *= 2" } ) ;
outputs a chart like:
Rate b a
b 2831802/s -- -61%
a 7208959/s 155% --
This chart is sorted from slowest to fastest, and shows the percent speed difference between each pair of tests.
cmpthese can also be passed the data structure that timethese() returns:
Thanks for correcting me I will also update my answer. Although to be honest I am kind of impressed how unpack is slower in comparison to substr and split.
Thanks again for your time and effort, BR.
Seeking for Perl wisdom...on the process of learning...not there...yet!
| [reply] [d/l] [select] |
Re: Faster and more efficient way to read a file vertically
by johngg (Canon) on Nov 05, 2017 at 15:37 UTC
|
I put together a benchmark for most of the suggested solutions (or adaptations of them to get consistent results) and ran tests against an inline dataset of 50 lines with Test::More then with a 50,000 line file produced by this one-liner.
perl -E '
my @alpha = ( qw{ A C G T } ) x 5;
push @alpha, qw{ . . };
say join q{}, map { $alpha[ rand @alpha ] } 1 .. 50
for 1 .. 50000;' > spw1202693.txt
Here's the script.
And the results.
ok 1 - ANDmask
ok 2 - brutish
ok 3 - pushAoA
ok 4 - regex
ok 5 - rsubstr
ok 6 - seek
ok 7 - split
ok 8 - substr
ok 9 - unpack
ok 10 - unpackM
Rate pushAoA brutish split seek regex unpack substr rsubstr
+ unpackM ANDmask
pushAoA 1.11/s -- -35% -61% -62% -91% -97% -98% -98%
+ -98% -99%
brutish 1.71/s 55% -- -39% -41% -86% -95% -96% -96%
+ -97% -98%
split 2.82/s 155% 65% -- -3% -77% -92% -94% -94%
+ -95% -97%
seek 2.91/s 163% 70% 3% -- -76% -92% -94% -94%
+ -95% -97%
regex 12.3/s 1010% 617% 336% 322% -- -65% -74% -75%
+ -79% -88%
unpack 35.0/s 3060% 1943% 1141% 1102% 185% -- -25% -27%
+ -40% -67%
substr 46.9/s 4137% 2638% 1564% 1512% 282% 34% -- -3%
+ -20% -55%
rsubstr 48.2/s 4254% 2714% 1610% 1556% 292% 38% 3% --
+ -18% -54%
unpackM 58.7/s 5194% 3321% 1979% 1914% 377% 68% 25% 22%
+ -- -44%
ANDmask 105/s 9407% 6045% 3634% 3517% 757% 201% 124% 118%
+ 80% --
1..10
The two substr solutions are neck and neck in the lead, unpack a distant third and everything else well behind. However, I have cocked up benchmarks before so take this with a pinch of salt!
Update: Corrected attribution of the "unpack" method and incorporated the two new methods and benchmark results from this post. Working with multi-line buffers using unpack or a mask to AND with the buffer seems to be the fastest approach.
| [reply] [d/l] [select] |
|
Interesting. I had similar partial synthetic benchmark yesterday, thought to publish it mainly to advice against my "seek" solution as too slow, then decided not to :), because maybe it's not worth readers' effort.
Nevertheless, somewhat different results for a 1 million lines file, and fast NVMe SSD storage. Below is the case for returning a hash with chars counts, but it's similar for returning string.
$ perl vert2.pl
ok 1 - same results
ok 2 - same results
ok 3 - same results
(warning: too few iterations for a reliable count)
(warning: too few iterations for a reliable count)
(warning: too few iterations for a reliable count)
(warning: too few iterations for a reliable count)
Rate seek buk substr slurp
seek 0.920/s -- -61% -84% -88%
buk 2.36/s 157% -- -58% -69%
substr 5.66/s 515% 140% -- -26%
slurp 7.69/s 736% 226% 36% --
1..3
| [reply] [d/l] [select] |
|
The following provides a parallel version for the slurp routine. I'm not sure why or where to look, running MCE via cmpthese reports inaccurately with MCE being 300x faster which is wrong. So, I needed to benchmark another way.
Regarding MCE, workers receive the next chunk and tally using a local hash. Then, update the shared hash.
use strict;
use warnings;
use MCE;
use MCE::Shared;
use String::Random 'random_regex';
use Time::HiRes 'time';
my $fn = 'dna.txt';
my $POS = 10;
my $shrcount = MCE::Shared->hash();
my $mce;
unless ( -e $fn ) {
open my $fh, '>', $fn;
print $fh random_regex( '[ACTG]{42}' ), "\n"
for 1 .. 1e6;
}
sub slurp {
open my $fh, '<', $fn;
my $s = do { local $/ = undef; <$fh> };
my $count;
$count-> { substr $s, $POS - 1 + 43 * $_, 1 }++
for 0 .. length( $s ) / 43 - 1;
return $count
}
sub mce {
unless ( defined $mce ) {
$mce = MCE->new(
max_workers => 4,
chunk_size => '300k',
use_slurpio => 1,
user_func => sub {
my ( $mce, $slurp_ref, $chunk_id ) = @_;
my ( $count, @todo );
$count-> { substr ${ $slurp_ref }, $POS - 1 + 43 * $_, 1 }++
for 0 .. length( ${ $slurp_ref } ) / 43 - 1;
# Each key involves one IPC trip to the shared-manager.
#
# $shrcount->incrby( $_, $count->{$_} )
# for ( keys %{ $count } );
# The following is faster for smaller chunk size.
# Basically, send multiple commands at once.
#
push @todo, [ "incrby", $_, $count->{$_} ]
for ( keys %{ $count } );
$shrcount->pipeline( @todo );
}
)->spawn();
}
$shrcount->clear();
$mce->process($fn);
return $shrcount->export();
}
for (qw/ slurp mce /) {
no strict 'refs';
my $start = time();
my $func = "main::$_";
$func->() for 1 .. 3;
printf "%5s: %0.03f secs.\n", $_, time() - $start;
}
__END__
slurp: 0.487 secs.
mce: 0.149 secs.
| [reply] [d/l] |
|
> unpack => sub { # Suggested but not implemented by pryrt
Actually unpack was suggested (and not implemented) by me first. ;)
FWIW: My idea was to unpack multiple lines simultaneously instead of going line by line.
If you are interested and all lines really have the same length (the OP never clarified)
- read a chunk of complete lines bigger 4 or 8kb (depending on the blocksize of the OS to optimize read operations)
- run a repeated unpack pattern
- get a list of 1 result for each chunk line
Please see if substr on single lines is still faster then.
$line_length += $newline_length; # OS dependend
$line_count = int(8 * 1024 / $line_length) +1;
$chunk_size = $line_count * line_length;
And yes I'm still reluctant to implement it, smells too much like an XY Problem :)
update
In hindsight... probably having a slightly smaller chunk is more efficient :
$line_count = int(8 * 1024 / $line_length) | [reply] [d/l] [select] |
|
| [reply] |
|
|
|
Re: Faster and more efficient way to read a file vertically
by vr (Curate) on Nov 03, 2017 at 17:48 UTC
|
If "same length", then straightforward and perhaps not perlish, and idea originated before Discipulus's answer :). I wonder how inefficient this is compared to slurping/reading in large blocks, i.e. if read and seek 'cooperate' on input buffer (I don't know enough on underlying C calls).
use strict;
use warnings;
use autodie;
my $POS = 10;
open my $fh, '<', 'dna.txt';
my $L = length( <$fh> ) - 1;
seek $fh, $POS - 1, 0;
my ( $s, $i ) = ( '', 0 );
seek $fh, $L, 1
while read $fh, $s, 1, $i++;
print "$s\n";
| [reply] [d/l] [select] |
|
FWIW, seek was my first thought, too. (Also that I'd prototype in Perl, then write the same thing in C. I might've found my weekend project... :) I can't imagine that allocating memory is going to help (I like when my imagination is challenged, though). I think at least if we can assume the file is in filesystem cache the read will be coming from RAM already anyway.
| [reply] |
|
I think this is parallelizable, too. If you have 24 cores, you can seek to $L/24, do your thing, combine results.
| [reply] |
Re: Faster and more efficient way to read a file vertically
by LanX (Saint) on Nov 03, 2017 at 15:15 UTC
|
> but this takes enormous amount of time
what does this mean?
Maybe it's just file access on the HD?
Please show some reference code.
> Any ideas?
You can slurp the whole file and run a regex ... something like @col10 = /^.{9}(.)/g on it (with the appropriate /s or /m modifier of course)
corrected my @col = ( $file =~ /^.{9}(.)/mg );
Using unpack might be even faster, but I'm no expert here.
| [reply] [d/l] [select] |
|
So basically I have this (brute-force attack):
while(<>)
{
if($_=~/^(.*?)\t(.*)/)
{
$read_seq=$1;
$read_id=$2;
@split_read=split(//, $read_seq);
$respective_read_letter=$split_read[$i];
if($respective_read_letter eq 'A')
{$count_A++;}
elsif($respective_read_letter eq 'T')
{$count_T++;}
elsif($respective_read_letter eq 'C')
{$count_C++;}
elsif($respective_read_letter eq 'G')
{$count_G++;}
elsif($respective_read_letter eq '.')
{$count_dot++;}
else
{print "ERROR in read: $read\t$respective_read_letter\
+n";}
}
}
$total=$count_A+$count_T+$count_C+$count_G+$count_dot;
$fraction_A = sprintf("%.2f", 100*($count_A/$total));
$fraction_T = sprintf("%.2f", 100*($count_T/$total));
$fraction_C = sprintf("%.2f", 100*($count_C/$total));
$fraction_G = sprintf("%.2f", 100*($count_G/$total));
$fraction_dot = sprintf("%.2f", 100*($count_dot/$total));
print $actual_pos,"\t",$expected_letter,"\t",$fraction_A,"\t",$fra
+ction_T,"\t",$fraction_G,"\t",$fraction_C,"\t",$fraction_dot,"\n";
| [reply] [d/l] |
|
If you're really only going to be doing one column, but want it to be chosen by the variable $i,
I'd suggest substr: $respective_read_letter = substr $read_seq, $i, 1;. If finding an optimum solution
is important to you (ie, if you'll use this script many times for the forseeable future, rather than just once or twice
where "fast engouh" is fast enough), then I'd recommend Benchmarking the substr vs unpack vs
LanX's regex (and any others that are suggested). But whatever you do, make sure to use ++LanX's hash %count.
use warnings;
use strict;
use Benchmark qw/cmpthese/;
use Test::More tests => 1;
my @dataset = ();
push @dataset, join('', map { (qw/A C G T/)[rand 4] } 1 .. 30 ) for 1
+.. 1000;
my $i = $ARGV[0] // 10;
sub test {
my $fnref = shift;
my $count;
for my $read_seq( @dataset ) {
my $letter = $fnref->($read_seq, $i);
$count->{$letter}++;
}
return $count;
}
sub rfn {
test( sub {
my $skip = $_[1];
$_[0] =~ /.{$skip}(.)/;
return $1;
});
};
sub sfn {
test( sub {
substr $_[0], $_[1], 1;
});
};
sub ufn {
test( sub {
... # I'm no unpack expert
});
};
cmpthese(0, {
regex => \&rfn,
substr => \&sfn,
#unpack => \&ufn,
});
is_deeply rfn(), sfn(), 'same results';
| [reply] [d/l] [select] |
|
| [reply] [d/l] [select] |
Re: Faster and more efficient way to read a file vertically
by Anonymous Monk on Nov 04, 2017 at 10:03 UTC
|
If speed is of high priority, one shouldn't overlook the mmap() approach using File::Map. It has its limitations (no piped data) but it allows regular files to be efficiently handled as one big string.
| [reply] |
Re: Faster and more efficient way to read a file vertically
by wazat (Monk) on Nov 04, 2017 at 18:56 UTC
|
OOPS, I see that yr already identified this approach.
If your lines are really all the same length, you could do the job via a seek() / read() loop. The example below needs error checking. I haven't done any speed tests.
#!/usr/bin/perl
use strict;
use warnings;
my $linesep_len = length($/);
my $rec_len = length('ACATCACCTCCCACAACGAGGACTACACCATCGTGGAACA') + $li
+nesep_len;
my $read_len = 1;
my $skip_len = $rec_len - $read_len;
binmode(DATA);
seek(DATA, 10, 1) or die "seek error";
my $buf = ' ' x $read_len;
while (read(DATA, $buf, $read_len) > 0) {
print $buf, "\n";
seek(DATA, $skip_len, 1) or last;
}
__DATA__
ACATCACCTCCCACAACGAGGACTACACCATCGTGGAACA
ACATCACCTCxCACAACGAGGACTACACCATCGTGGAACA
ACATCACCTCsCACAACGAGGACTACACCATCGTGGAACA
ACATCACCTCjCACAACGAGGACTACACCATCGTGGAACA
ACATCACCTCcCACAACGAGGACTACACCATCGTGGAACA
ACATCACCTCCCACAACGAGGACTACACCATCGTGGAACA
ACATCACCTCCCACAACGAGGACTACACCATCGTGGAACA
ACATCACCTCCCACAACGAGGACTACACCATCGTGGAACA
ACATCACCTCCCACAACGAGGACTACACCATCGTGGAACA
ACATCACCTC-CACAACGAGGACTACACCATCGTGGAACA
Output:
C
x
s
j
c
C
C
C
C
-
| [reply] [d/l] [select] |
Re: Faster and more efficient way to read a file vertically
by Anonymous Monk on Nov 04, 2017 at 18:43 UTC
|
| [reply] |
|
"...new job"
Fake News.
«The Crux of the Biscuit is the Apostrophe»
perl -MCrypt::CBC -E 'say Crypt::CBC->new(-key=>'kgb',-cipher=>"Blowfish")->decrypt_hex($ENV{KARL});'Help
| [reply] [d/l] |