Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

Re^2: Reading HUGE file multiple times

by Anonymous Monk
on Apr 28, 2013 at 12:42 UTC ( #1031059=note: print w/ replies, xml ) Need Help??


in reply to Re: Reading HUGE file multiple times
in thread Reading HUGE file multiple times

Hi there,

Thanks for the tips. My data looks something like

>ID

Data (a verrry long string of varying length in a single line)

>ID again

Data again

Indexing might be a good idea. Maybe I could only read the IDs (skipping the next line) and then when accessing just add +1 to the index? I need to extract them twice in a code in different subroutines and each time the subroutine specifies what to do with them. I don't know if it is a good idea to store it all in a hash. I only need to extract a fragment of the data in first read and the whole data entry in the other. I don't have the IDs in advance, the suroutine specifies which one I need and what to do with it. I've tried

$Library_Index{<$Library>} = tell(ARGV), scalar <$Library> until eof();

but it takes very long time to do. I wonder if there is a better way to do it since this would be a bottleneck.


Comment on Re^2: Reading HUGE file multiple times
Download Code
Re^3: Reading HUGE file multiple times
by BrowserUk (Pope) on Apr 28, 2013 at 13:30 UTC
    I've tried $Library_Index{<$Library>} = tell(ARGV), scalar <$Library> until eof(); but it takes very long time to do.

    How long?

    I wonder if there is a better way to do it since this would be a bottleneck.

    If you are re-using the file many times you could construct the index and write it to a separate file. It should be substantially quicker to read the index from a file than to construct it from scratch each time.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      It's over 10 min so I kill the process. I think the reason is it's writing data line as a hash name and the data line can have 300.000 characters. I changed it so it only reads the index of the ID and then will just add 1 to it when I need the data. Then it's done in couple of seconds. Thanks for the tips with the index.
        I think the reason is it's writing data line as a hash name and the data line can have 300.000 characters.

        No, it's not.

        At least, if your description of the file is accurate it isn't.

        This bit of the code: $Library_Index{<$Library>} = tell(ARGV), reads the IDs and constructs the hash.

        And this bit: scalar <$Library> reads and discards the long data lines.

        However, Now I think I see the problem with your version of the code.

        This bit:until eof(); of the line iterates until the file is read, except that you forgot to put the filehandle $Library in the parens, so the program will never end because it is testing the end-of-file condition of a different file which will never be true.

        Change the line to:

        $Library_Index{<$Library>} = tell(ARGV), scalar <$Library> until eof($ +Library);

        And see how long it takes.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

      I tried the new code and it works really fast. Problem is there is an error with tell and it's all -1. From Dumper

      $VAR32564 = -1; $VAR32565 = '>ENST00400413799 ';

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1031059]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (3)
As of 2015-07-04 23:07 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (60 votes), past polls