http://www.perlmonks.org?node_id=1031069


in reply to Re^4: Reading HUGE file multiple times
in thread Reading HUGE file multiple times

I think the reason is it's writing data line as a hash name and the data line can have 300.000 characters.

No, it's not.

At least, if your description of the file is accurate it isn't.

This bit of the code: $Library_Index{<$Library>} = tell(ARGV), reads the IDs and constructs the hash.

And this bit: scalar <$Library> reads and discards the long data lines.

However, Now I think I see the problem with your version of the code.

This bit:until eof(); of the line iterates until the file is read, except that you forgot to put the filehandle $Library in the parens, so the program will never end because it is testing the end-of-file condition of a different file which will never be true.

Change the line to:

$Library_Index{<$Library>} = tell(ARGV), scalar <$Library> until eof($ +Library);

And see how long it takes.


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

Replies are listed 'Best First'.
Re^6: Reading HUGE file multiple times
by Anonymous Monk on Apr 28, 2013 at 14:16 UTC

    Ups, replied in a wrong place.

    I tried the new code and it works really fast. Problem is there is an error with tell and it's all -1. Would be nice if I could just have ENST04000413399 as and ID but it does not mater that much. From Dumper

    $VAR32564 = -1; $VAR32565 = '>ENST04000413399 ';
      Problem is there is an error with tell and it's all -1.

      D'oh! I made the same mistake you did; I forgot to change ARGV for $Library. The line should read:

      $Library_Index{<$Library>} = tell($library), scalar <$Library> until e +of($Library);

      I tested the write-the-index-to-disc code with a file containing 17,000 id/record pairs with 300,000 data records (5.2GB).

      This creates the index and writes it to disc:

      #! perl -slw use strict; use Storable qw[ store ]; print time; my %idx; $idx{ <> } = tell( STDIN ), scalar <> until eof STDIN; store \%idx, '1031021.idx' or die $!; print time;

      The whole process takes a little over 3 minutes:

      C:\test>1031021-i.pl <1031021.dat 1367160156 1367160362 C:\test>dir 1031021* 28/04/2013 15:30 193 1031021-i.pl 28/04/2013 15:04 5,272,940,608 1031021.dat 28/04/2013 15:46 316,385 1031021.idx 28/04/2013 15:29 374 1031021.pl

      And this code loads that index from disk (<1 second) and the reads 1000 random records (26 seconds) using it:

      #! perl -slw use strict; use Storable qw[ retrieve ]; print time; my $idx = retrieve '1031021.idx' or die $!; print time; open DAT, '+<', '1031021.dat' or die $!; for( 1 .. 1000 ) { my( $id, $offset ) = each %$idx; seek DAT, $offset, 0; my $vid = <DAT>; die 'mismatch' unless $id eq $vid; my $data = <DAT>; } close DAT; print time;

      Run:

      C:\test>1031021.pl 1367160624 1367160624 1367160651

      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        Perfect! Works like a charm and is blazing fast comparing to initial read method. Thanks so much for your help.