Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation

Re^2: huge file->hash

by ISAI student (Scribe)
on Apr 21, 2005 at 07:43 UTC ( #449876=note: print w/replies, xml ) Need Help??

in reply to Re: huge file->hash
in thread huge file->hash

Sorry about that. The answer is this:
The dif_file contains two lists.
1. paths to be looked in file path_n
2. paths to be looed in file path_p
I know how to map from the list to the file and vice-veras
By reading the path data, from one huge file at a time (a few lines for each path) I can go and look it's respective list. if it is there, I should do X, continue
My idea was to convert these two lists into hashes, and to keep deleting from the hashes paths found, as It is relatively easy to genrate the key, so using exists($hash{$key}) shouldn't be that much of a problem.
That way, when the two hashes are empty, I can simply write my final output and exit, and not read the two huge files for paths that aren't needed.
Is it more easily understood now?

Replies are listed 'Best First'.
Re^3: huge file->hash
by BrowserUk (Pope) on Apr 21, 2005 at 11:45 UTC

    The size of your hashes in memory is going to be your limiting factor.

    You say each of your files is approx. 2GB in size, and each element is "several lines". You certainly won't be able to store a hash containing all the data in both files in memory on most normal PCs that typically have a 2GB/process ram limit.

    If your paths are an average of say 5 lines of 80 chars in length, then that gives approximately 5 million paths in each file. And if the keys to your paths were say 10 chars in length, then you could represent each file in memory with a hash that has 5 million 10-char keys and a single integer value--the byte offset into the file. That approximates to 250MB per file, which would theoretically allow you to index both your huge files in memory simultaneously, if the machine this is going to run on has a full complement of ram.

    However, if your paths are substantially less in average length (giving more paths) or your keys are substantially longer, then memory requirement for each hash grows and you start to move to the point where you are pushing the limits of what is possible.

    Your alternative in that case is to look at further reducing the memory requirements of storing the index, which basically means getting into using (or writing) some kind of less memory hungry alternative to hashes--Tie::SubstrHash comes with (most) distributions of perl, and whilst it's a bit awkward to use, and substantially slower than a real hash, it does reduce the memory requirements markedly.

    Or, you could load your data into some flavour of DB.

    A Berkely DB might fit the bill, but it requires moving your datafiles into the Berkeley format. This is a fairly slow process, and the resultant files will be much larger than the equivalent flats files. Once transfered, and with the optimal configuration, the actual lookups during processing will be quite fast say 4x to 10x slower than a normal hash.

    Or you could use a general purpose RDBMS. Again the initial loading of the flatfiles into the DB will require a fair amount of time, and the diskspace requirement climbs substantially, and as each lookup will require a separate SQL statement with it's communications overhead, so the processing stage will be substantially slower than the Berkeley option.

    Which approach is best for your situation really depends on how often you would need to do the processing, how often you would need to reload the DB, and whether you could change the process that produces the two files to write the data directly to one form of DB or the other.

    You haven't provided any clues to the numbers of records (paths), or their sizes, or the key lengths, or the frequencies of the tasks, so the best that anyone can do is give you a very general reply.

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco.
    Rule 1 has a caveat! -- Who broke the cabal?
Re^3: huge file->hash
by graff (Chancellor) on Apr 21, 2005 at 14:33 UTC
    Is it more easily understood now? Um, sorry, but no I'm afraid it is not.

    Go ahead and try something out. If what you try works to your satisfaction, great. If not, and if you would like us to help, show us the code you tried, along with a minimal sample of input data from each file.

    And if you can, try to state clearly what your "final output" is supposed to be.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://449876]
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others perusing the Monastery: (4)
As of 2017-07-27 05:21 GMT
Find Nodes?
    Voting Booth?
    I came, I saw, I ...

    Results (403 votes). Check out past polls.