Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

character-by-character in a huge file

by mushnik (Acolyte)
on Apr 08, 2004 at 17:21 UTC ( #343672=perlquestion: print w/ replies, xml ) Need Help??
mushnik has asked for the wisdom of the Perl Monks concerning the following question:

Hi monks -

I'm writing a script that makes one pass through a very large (up to 3 GB) file, and needs to keep track of the first and last characters in a sliding window of constant small size (e.g. 500).

Walking the filehandle with getc(FH) (actually, two handles - one trailing the other by 500 bytes) is pretty slow. Line by line is a lot faster if I don't try to look at individual characters...but then I've got to index into the line with substr, which slows things down even more. Reading each line into a character array with split is dog slow.

The format of the file I'm hitting allows me to grab very large chunks (10-50 thousand characters each) by setting $/=">", because ">" appears at regular, distant intervals. But indexing into the string captured from the file in that waywith substr is amazingly slow. Again, pushing the string to an array (where indexing is fast) with split is even worse.

Any suggestions about how to take advantage of the fact that Perl is perfectly fast at reading in large hunks of the file into a string, without having to index into that string with substr?

In short, can you beat the time I get reading a huge file by using:

  open (FH, "<ciona_fasta_two");
  until (eof(FH)) {
    $ch = getc(FH);
    if ($ch eq 'x'){
      #do something;
    }
  }
  close FH;
(It should be possible to beat it by *a lot*)

Thanks, Travis

Comment on character-by-character in a huge file
Re: character-by-character in a huge file
by eric256 (Parson) on Apr 08, 2004 at 17:35 UTC

    I'm a bet confused by your definition there. Do you need to remember the whole window of 500 characters? or just the first and last? Have you tried keeping an array and pushing single characters onto one end and then off the other? That seems to me to be the easiest way to keep track of a window of characters. I don't think that character by character comparisons on a huge file will ever be very fast. If you use an array though you can read by line and then split it. That way you can take advantage of perls file IO buffering.


    ___________
    Eric Hodges
      Have you tried keeping an array and pushing single characters onto one end and then off the other? That seems to me to be the easiest way to keep track of a window of characters.

      No.

Re: character-by-character in a huge file
by hardburn (Abbot) on Apr 08, 2004 at 17:43 UTC

    Try writing a quick program that reads the file in whatever way your main program does. Just read it in and nothing else. Keep track of the run time. Then compare it to the runtime of the real program. I bet there isn't a great deal of difference. The reason is that simply reading 3GB of data using any method is going to be quite slow.

    ----
    : () { :|:& };:

    Note: All code is untested, unless otherwise stated

      I have to disagree; that code will execute about 14 ops per character (or about 42000000000 ops total); even the overhead of perl's op dispatch loop will be huge, much less the actual code of the ops themselves. Simulating just the basic overhead with perl -we'1 for 1..14000000000' (3 ops per loop iteration) runs about 36 minutes on my system; this is substantially longer than it would take to read through 3Gb.
Re: character-by-character in a huge file
by ccn (Vicar) on Apr 08, 2004 at 17:49 UTC
    $/ = \10000000; open (FH, "<ciona_fasta_two"); while (<FH>) { my $i = -1; while( -1 < ($i = index($_, 'x', $i+1))) { #do something; } } close FH;
Re: character-by-character in a huge file
by Zaxo (Archbishop) on Apr 08, 2004 at 17:51 UTC

    You can read your whole window at once by setting $/ to a reference to it's literal constant width.

    { open (FH, "<ciona_fasta_two") or die $!; local ($/, $_) = (\500); while (<FH>) { my ($first, $last) = (substr($_,0,1), substr($_,-1,1)); # do other stuff } close FH or die $!; }

    After Compline,
    Zaxo

      That's not a sliding window.

        It is with seek.

        After Compline,
        Zaxo

Re: character-by-character in a huge file
by Anomynous Monk (Scribe) on Apr 08, 2004 at 18:17 UTC
    I think you need to say more about what you need to do with the first and last characters. Can you ignore 500-char ranges spanning the > marks? Are you looking for some literal first character? If the answers are yes, something like this should go pretty quickly, using a regex to go through the buffer one character at a time:
    $/ = ">"; while (my $line = <>) { while ($line =~ m/x(?=.{498}(.))/gs) { my $from_x_500_chars = $1; } }
    If you do try different methods suggested, it would be nice if you posted a summary of results.
Re: character-by-character in a huge file
by kvale (Monsignor) on Apr 08, 2004 at 18:20 UTC
    Consider the informal benchmark
    use Benchmark qw|:all|; my @a = (1..1_000_000); my $b; my $c = 'x' x 1_000_000; my $d; cmpthese( 5, { ary => sub { for my $i (0..1_000_000-501) { $b = $a[$i] - $a[$i+500]; } }, substr => sub { for my $i (0..1_000_000-501) { $d = substr($c,$i,1) . substr($c,$i+500,1); } }, unpack => sub { my @b = unpack 'c*', $c }, split => sub { my @b = split //, $c } } );
    On my arthritic PIII machine, I get
    Benchmark: timing 5 iterations of ary, split, substr, unpack... ary: 8 wallclock secs ( 8.62 usr + 0.00 sys = 8.62 CPU) @ 0 +.58/s (n=5) split: 26 wallclock secs (24.93 usr + 0.33 sys = 25.26 CPU) @ 0 +.20/s (n=5) substr: 11 wallclock secs (11.64 usr + 0.00 sys = 11.64 CPU) @ 0 +.43/s (n=5) unpack: 7 wallclock secs ( 6.14 usr + 0.03 sys = 6.17 CPU) @ 0 +.81/s (n=5) s/iter split substr ary unpack split 5.05 -- -54% -66% -76% substr 2.33 117% -- -26% -47% ary 1.72 193% 35% -- -28% unpack 1.23 309% 89% 40% --
    This says that, for the ops between elements $i and $i+500 I picked, an array rep is faster than a string rep, but the overhead of breaking a string into an array will negate that advantage. The best you could do on my machine with these algorithms and ops it is to move along the string at about 500_000 chars/sec.

    Another solution that has not been mentioned yet is using a regexp. From perlretut:

    while ($dna =~ /(?=A.{500}T)./g) { print "Got an AT match at position ", pos $dna, "\n"; }
    Update Fixed regexp.

    -Mark

Re: character-by-character in a huge file
by ambrus (Abbot) on Apr 08, 2004 at 18:51 UTC

    Just an idea, I'm not sure whether you can use it.

    If you make a copy of a large string you've read, the two copies will have separate pos-itions, thus, you can traverse both with regexps without the need for substr'ing it.

    For example:

    $string1= "hello, world\n"; $string2= $string1; pos($string1)= 4; whil +e ($string1=~/(.)/gs) {$win_right= $1; $string2=~/(.)/gs; $win_left= +$1; print " "x$n++, "$win_left...$win_right\n"}

    prints:

    h...o e..., l... l...w o...o ,...r ...l w...d o...

    This will of course have difficulties on what to do at the boundary of two blocks you've read.

    Update: spelling. Whether, whether, whether, whether, I'll have to learn this eventually.

Re: character-by-character in a huge file
by thor (Priest) on Apr 08, 2004 at 20:46 UTC
    Perhaps I'm seeing things that aren't there, but isn't fasta some sort of gene sequencing format? If so, why re-invent the wheel?

    thor

Re: character-by-character in a huge file
by Itatsumaki (Friar) on Apr 08, 2004 at 20:48 UTC

    This is a fasta-formatted sequence file, right? You're best bet for doing this is might be the BioPerl library. They have both a regular fasta-parser as well as a special indexed parser for genome-sized files. The latter works really well if you need to get subsequences from a file. For instance, I use the indexed-version when I want to extract all mRNA sequences from a single chromosome.

    Here is the code for using the indexed-version. It processes every mRNA on every chromosome in a typical eukaryotic genome in about 30 minutes on a P4.

    # set up chromosome sequence object my $file = $chromosome.'.fa'; my $seqio = Bio::SeqIO->new( -format => 'largefasta', -file => $file ); # set up range parameters my $strand = '+'; my $seq_start = 1_000_000; my $seq_end = 2_000_000; my $downstream= 10_000; # handle sequences on both strands if ($strand eq '+') { $seq_start = $tx_start - $upstream; $seq_end = $tx_start + $downstream; if ($seq_start < 0) { next(); } if ($seq_end > $length) { next(); } $seq_mRNA = $seq_chr->trunc($seq_start,$seq_end); } elsif ($strand eq '-') { $seq_start = $tx_end + $upstream; $seq_end = $tx_end - $downstream; if ($seq_end < 0) { next(); } if ($seq_start > $length) { next(); } $seq_mRNA = $seq_chr->trunc($seq_end,$seq_start)->revcom(); } else { warn "No strand: $strand!\n"; next(); } # write seq to a file my $outfile = $mRNA.'.fasta'; open(OUT, '>', $outfile); print OUT ">$mRNA\n", $seq_mRNA->seq(); close(OUT);

    And there are always the usual caveats with BioPerl -- great user support group, fairly big initial learning curve, highly object-oriented library....

    -Tats
Re: character-by-character in a huge file
by bart (Canon) on Apr 09, 2004 at 12:26 UTC
    It makes no sense to read a large file a byte at a time, on a system where you use perl. 3GB is a bit much to hold in memory at once, at least it is for me, but a buffer of a few k to a few tens of k should work really well. Also, there's no need for the two handles... use the same buffer for both.

    My idea is to fill the buffer, look for the start character, and if you find it too close to the end of the block, read some more and append it to the buffer. read() even supports this operation out of the box.

    Sample code (not tested well):

    my $windowsize = 500; use constant BLOCKLENGTH => 4096; my $buffer = ""; my $offset = 0; while(my $r = read FH, $buffer, BLOCKLENGTH) { my $i = -1; until(($i = index($buffer, "x", ++$i)) < 0) { printf "found 'x' at %d+%d\n", $offset, $i; if($i+$windowsize > length $buffer) { # get rid of what we no longer need, or we might end up wi +th a buffer holding the whole huge file: $offset += $i; $buffer = substr $buffer, $i; $i = 0; # append a new buffer (assuming BLOCKLENGTH >= $windowsize +): $r = read FH, $buffer, BLOCKLENGTH, length $buffer; last if $windowsize > length $buffer; #not long enough } # do something here... printf "offset for 'x' is %d, found '%s' at %d\n", $offset + $i, substr($buffer, $i + $windowsize - 1, 1), $offset+$i+$windowsize-1; } } continue { $offset += length $buffer; }

      One character each time, why not? Perl already buffers the file so getc should be faster than a lot of magic with substr (or unpack).

        It depends on the frequency of the matches. I'm convinced that one call to index() on a string of 10k, with a negative result, or just a few matches, is a lot faster than 10000 calls to getc() and the same number of eq tests. It's a matter of doing the same task in C, or in Perl. C usually wins hands down.
Re: character-by-character in a huge file
by BrowserUk (Pope) on Apr 10, 2004 at 06:38 UTC

    Please see Optimising processing for large data files. for some techniques for reducing your runtimes by 96%, if your application doesn't lend itself to using Bioperl libraries.


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "Think for yourself!" - Abigail
      Thanks for your thorough consideration of my problem. More detail than I expected to receive...which is a good thing.

      A quick aside: as you noted, it's been suggested that I use bioperl to tackle my problem. I use bioperl nearly every day in many various ways...when it's tools fit my need. In this case, bioperl offers nothing relevant (fasta format is trivial, and there's no facility in bioperl to "read character by character, really quickly").

      Sadly, the optimizations you've suggested don't work nearly as well for me as they do for you. The first part of the benchmark code (below) deals with this problem. For simplicity, I've removed the bit about sliding window. It's something I need to deal with, but it's not really at the heart of my problem. In this test, I just check to see how long it takes to get access to every character.

      The heart of the problem is that I need to read the value of each character in the file one-at-a-time (the "why" is directly related to the sliding window: I can either recalculate the character counts of the window each time I slide over one space, or I can just consider the character being left behind by the slide, and the new character being added). But accessing each individual character from a file is dreadfully slower in perl than just slurping the file off disk. This is shown with the second part of the benchmark:

      ... (running Redhat Linux 9.0. perl 5.8.0, testing with a 2MB file)

      #!/usr/bin/perl -sl use strict; use Benchmark qw|:all|; my $ch; my $filename = "ciona_fasta_two"; $/ = ">"; cmpthese( 10, { getc => sub { open (FH, "<$filename"); until (eof(FH)) { $ch = getc(FH); } close FH; }, slurp_substr => sub { open (FH, "<$filename"); my $i = 0; while ( <FH>) { while ($ch = substr($_,$i++,1)){ } } close FH; }, raw_slurp_substr => sub { open (FH, '<:raw',"$filename"); my $i = 0; while ( <FH>) { while ($ch = substr($_,$i++,1)){ } } close FH; }, slurp_regex => sub { open (FH, "<$filename"); while ( <FH>){ while ( /(.)/g){ } #the char is in $1 } close FH; }, raw_sysread_onechar => sub { open (FH, '<:raw',"$filename"); while ( sysread( FH, $ch, 1 ) ) {} close FH; }, nonraw_sysread_onechar => sub { open (FH, '<:raw',"$filename"); while ( sysread( FH, $ch, 1 ) ) {} close FH; }, raw_sysread_buffer => sub { my ($read, $buf); open (FH, '< :raw', "$filename"); while ( $read = sysread( FH, $buf, 100*4096) ) {#faster than 1 +*4096 for ( 1 .. $read ) { $ch = substr( $buf, $_, 1 ); } } close FH; }, nonraw_sysread_buffer => sub { my ($read, $buf); open (FH, "<$filename"); while ( $read = sysread( FH, $buf, 100*4096) ) { for ( 1 .. $read ) { $ch = substr( $buf, $_, 1 ); } } close FH; }, } ); cmpthese( 10, { slurp_substr => sub { open (FH, "<$filename"); my $i = 0; while ( <FH>) { while ($ch = substr($_,$i++,1)){ } } close FH; }, slurp_simpleregex => sub { my $len=0; open (FH, "<$filename"); while ( <FH>){ $_ =~ /(.)$/; } close FH; }, slurp_length => sub { my $len=0; open (FH, "<$filename"); while ( <FH>){ $len += length($_); } close FH; }, } ); -----> RESULTS -----> s/iter raw_sysread_onechar nonraw_sysread_onechar +raw_slurp_substr getc nonraw_sysread_buffer slurp_regex raw_sysread_b +uffer slurp_substr raw_sysread_onechar 2.97 -- -0% -4% -19% -46% -51% -53% -70% nonraw_sysread_onechar 2.95 0% -- -3% -19% -46% -51% -52% -70% raw_slurp_substr 2.85 4% 4% -- -16% -44% -49% -51% -69 +% getc 2.40 24% 23% 19% -- -33% -40% -41% -63% + nonraw_sysread_buffer 1.60 86% 85% 79% 50% -- -10% -12% -45% slurp_regex 1.45 105% 104% 97% 66% 11% -- -3% -39% raw_sysread_buffer 1.41 111% 110% 103% 70% 13% 3% -- -37% slurp_substr 0.886 235% 234% 222% 171% 80% 63% 59% -- Rate slurp_substr slurp_length slurp..regex slurp_substr 1.14/s -- -96% -97% slurp_length 31.2/s 2634% -- -12% slurp_simpleregex 35.7/s 3025% 14% --

      As you can see from the first test, adding the "raw" option doesn't help, and sysread generally slows things down doesn't seem to help at all. In general, the best performance I can get is by slurping a bunch of content with the standard $line=<FH> syntax, then indexing into the string using substr.

      But as you can see from the 2nd benchmark, the performance of this option is actually pretty terrible. I can read in the entire contents of the file (or even test with regex, without assigning individual characters to an accesible variable) in 1/30th the time it takes to look at each character in the file. That's much worse than what I've seen (but haven't tested here) in C.

      I still hold out hope that there's a faster option - and based on your complete treatment of the topic in your original post, hope you'll either a) see what I'm doing wrong, or b) come up with another speedup idea.

      Thanks again, Travis

        a) see what I'm doing wrong,

        The first thing you are doing wrong is that you are comparing apples and oranges. Take your 2nd benchmark.

        cmpthese( 10, { slurp_substr => sub { open (FH, "<$filename"); my $i = 0; while ( <FH>) { while ($ch = substr($_,$i++,1)){ } } close FH; }, slurp_simpleregex => sub { my $len=0; open (FH, "<$filename"); while ( <FH>){ $_ =~ /(.)$/; } close FH; }, slurp_length => sub { my $len=0; open (FH, "<$filename"); while ( <FH>){ $len += length($_); } close FH; }, });
        1. Slurp_substr()

          This reads the whole file, record-by-record, and then appears to set the (global) variable $ch to each character in each record.

          But, your setting the variable $i outside the main loop; incrementing it for each char in the record; but never resetting it.

          Hence, for the second and subsequent records, $i will have whatever value it had at the end of the previous record. If the first record is longer than the rest, it will do nothing for any record other than the first.

          Both slurp_substr() and raw_slurp_substr() routines in the 1st benchmark are similarly afflicted.

        2. slurp_simpleregex()

          Your regex says put the last character of each record into $1. Your simply ignoring every character except the last in each record.

        3. slurp_length()

          This is the most mysterious of all. You read each record in the file and accumulate the lengths of those records into the variable $len.

          You never access any of the characters?

        The first rule of benchmarking is that you need to ensure that you are comparing like with like.

        The second rule is that you need to make sure that what you are benchmarking is useful and relevant to your final program.

        In these tests, you are doing neither.


        That's much worse than what I've seen (but haven't tested here) in C.

        If your point is that Perl is slower than C. Your right.


        Examine what is said, not who speaks.
        "Efficiency is intelligent laziness." -David Dunham
        "Think for yourself!" - Abigail

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://343672]
Approved by Paladin
Front-paged by kvale
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chilling in the Monastery: (15)
As of 2014-07-30 09:28 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (230 votes), past polls