Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

Re^5: Reading HUGE file multiple times

by BrowserUk (Pope)
on Apr 28, 2013 at 13:45 UTC ( #1031069=note: print w/ replies, xml ) Need Help??


in reply to Re^4: Reading HUGE file multiple times
in thread Reading HUGE file multiple times

I think the reason is it's writing data line as a hash name and the data line can have 300.000 characters.

No, it's not.

At least, if your description of the file is accurate it isn't.

This bit of the code: $Library_Index{<$Library>} = tell(ARGV), reads the IDs and constructs the hash.

And this bit: scalar <$Library> reads and discards the long data lines.

However, Now I think I see the problem with your version of the code.

This bit:until eof(); of the line iterates until the file is read, except that you forgot to put the filehandle $Library in the parens, so the program will never end because it is testing the end-of-file condition of a different file which will never be true.

Change the line to:

$Library_Index{<$Library>} = tell(ARGV), scalar <$Library> until eof($ +Library);

And see how long it takes.


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.


Comment on Re^5: Reading HUGE file multiple times
Select or Download Code
Replies are listed 'Best First'.
Re^6: Reading HUGE file multiple times
by Anonymous Monk on Apr 28, 2013 at 14:16 UTC

    Ups, replied in a wrong place.

    I tried the new code and it works really fast. Problem is there is an error with tell and it's all -1. Would be nice if I could just have ENST04000413399 as and ID but it does not mater that much. From Dumper

    $VAR32564 = -1; $VAR32565 = '>ENST04000413399 ';
      Problem is there is an error with tell and it's all -1.

      D'oh! I made the same mistake you did; I forgot to change ARGV for $Library. The line should read:

      $Library_Index{<$Library>} = tell($library), scalar <$Library> until e +of($Library);

      I tested the write-the-index-to-disc code with a file containing 17,000 id/record pairs with 300,000 data records (5.2GB).

      This creates the index and writes it to disc:

      #! perl -slw use strict; use Storable qw[ store ]; print time; my %idx; $idx{ <> } = tell( STDIN ), scalar <> until eof STDIN; store \%idx, '1031021.idx' or die $!; print time;

      The whole process takes a little over 3 minutes:

      C:\test>1031021-i.pl <1031021.dat 1367160156 1367160362 C:\test>dir 1031021* 28/04/2013 15:30 193 1031021-i.pl 28/04/2013 15:04 5,272,940,608 1031021.dat 28/04/2013 15:46 316,385 1031021.idx 28/04/2013 15:29 374 1031021.pl

      And this code loads that index from disk (<1 second) and the reads 1000 random records (26 seconds) using it:

      #! perl -slw use strict; use Storable qw[ retrieve ]; print time; my $idx = retrieve '1031021.idx' or die $!; print time; open DAT, '+<', '1031021.dat' or die $!; for( 1 .. 1000 ) { my( $id, $offset ) = each %$idx; seek DAT, $offset, 0; my $vid = <DAT>; die 'mismatch' unless $id eq $vid; my $data = <DAT>; } close DAT; print time;

      Run:

      C:\test>1031021.pl 1367160624 1367160624 1367160651

      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        Perfect! Works like a charm and is blazing fast comparing to initial read method. Thanks so much for your help.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1031069]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others meditating upon the Monastery: (17)
As of 2015-07-30 18:40 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (273 votes), past polls