Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

huge file->hash

by ISAI student (Scribe)
on Apr 20, 2005 at 12:21 UTC ( #449565=perlquestion: print w/ replies, xml ) Need Help??
ISAI student has asked for the wisdom of the Perl Monks concerning the following question:

Hello all. I am an EE, so while my question may seem to some of you basic, for me it is not. How to efficiently search in a 2 Gig file, w/o risk of running out of memory or two great runtime?

I have the next dillema:
I have 2 text depicting electrical path data. Each one of them can amount to 2Gig. let them be path_p and path_n I have another file, containing paths which were in only one of these files. This file is dif_file Each file may contain two different types, MIN & MAX. i know how to discern one from another These paths are seperated by ^Path (path_number). i.e. each one of these lines means a new path. The file has a header, but not a footer. I need, based on the information in dif_file, find the corresponsing path (using data within it) in the 2Gig file. This better be done with as little runtime as possible from an algorithm point of view, as I don't plan on writing the code in C ;-), and that it would not run out of memory. My initial concpet was this:

1. Get a list for path_n from dif_file
2. get a list for path_n from dif_file
3. Open path_n, read it line by line into a huge hash (reading from ^Path to the next ^Path, and then generating the key, seeing if it fits, keep it, else discard) and then generating the list.
Repeat for path_p
Does anyone have a better concept?

Edit by tye: Preserve formatting

Comment on huge file->hash
Re: huge file->hash
by graff (Chancellor) on Apr 20, 2005 at 13:06 UTC
    I'm not sure I fully understand your goal, but I think you do not want to try loading the path_n or path_p file into a hash.

    If I read your post correctly, your script will have three inputs:

    • path_n, which contains many path strings, plus data about each path (total size about 2GB), with one "path+data" record per line
    • path_p, which is like path_n in size and structure, but contains some path strings not in common with path_n
    • dif_file, which contains a list of the path strings that are unique in the two path files

    You didn't say, exactly, but I assume that your goal is to find the records in each path file that matches one of the unique path strings, and do something with the full content of those records -- maybe just output these records.

    If I got that right, then the method you want is something like this:

    1. read dif_file into a hash, using each path name as a hash key; you don't need to worry about what the hash value is -- you could do  $diff_hash{$path} = undef;
    2. open path_n, read it one line at a time, assign the path string to a variable, and see if %diff_hash contains an element with that path as the hash key; if so, print out the full record, otherwise go on to the next record
    3. open path_p, and do the same thing you did for path_n

    So the loop over the path_* files might look like this:

    for my $file ( qw/path_n path_p/ ) { open PATH, $file or die "$file: $!"; while (<PATH>) { # use a regex match with parens to capture path string in $1 # and test to see if the path string was in dif_file: if ( /^(\S+)/ and exists( $diff_hash{$1} )) { print; # or whatever else you need to do with this record } } close PATH; }
      Sorry about that. The answer is this:
      The dif_file contains two lists.
      1. paths to be looked in file path_n
      2. paths to be looed in file path_p
      I know how to map from the list to the file and vice-veras
      By reading the path data, from one huge file at a time (a few lines for each path) I can go and look it's respective list. if it is there, I should do X, continue
      My idea was to convert these two lists into hashes, and to keep deleting from the hashes paths found, as It is relatively easy to genrate the key, so using exists($hash{$key}) shouldn't be that much of a problem.
      That way, when the two hashes are empty, I can simply write my final output and exit, and not read the two huge files for paths that aren't needed.
      Is it more easily understood now?

        The size of your hashes in memory is going to be your limiting factor.

        You say each of your files is approx. 2GB in size, and each element is "several lines". You certainly won't be able to store a hash containing all the data in both files in memory on most normal PCs that typically have a 2GB/process ram limit.

        If your paths are an average of say 5 lines of 80 chars in length, then that gives approximately 5 million paths in each file. And if the keys to your paths were say 10 chars in length, then you could represent each file in memory with a hash that has 5 million 10-char keys and a single integer value--the byte offset into the file. That approximates to 250MB per file, which would theoretically allow you to index both your huge files in memory simultaneously, if the machine this is going to run on has a full complement of ram.

        However, if your paths are substantially less in average length (giving more paths) or your keys are substantially longer, then memory requirement for each hash grows and you start to move to the point where you are pushing the limits of what is possible.

        Your alternative in that case is to look at further reducing the memory requirements of storing the index, which basically means getting into using (or writing) some kind of less memory hungry alternative to hashes--Tie::SubstrHash comes with (most) distributions of perl, and whilst it's a bit awkward to use, and substantially slower than a real hash, it does reduce the memory requirements markedly.

        Or, you could load your data into some flavour of DB.

        A Berkely DB might fit the bill, but it requires moving your datafiles into the Berkeley format. This is a fairly slow process, and the resultant files will be much larger than the equivalent flats files. Once transfered, and with the optimal configuration, the actual lookups during processing will be quite fast say 4x to 10x slower than a normal hash.

        Or you could use a general purpose RDBMS. Again the initial loading of the flatfiles into the DB will require a fair amount of time, and the diskspace requirement climbs substantially, and as each lookup will require a separate SQL statement with it's communications overhead, so the processing stage will be substantially slower than the Berkeley option.

        Which approach is best for your situation really depends on how often you would need to do the processing, how often you would need to reload the DB, and whether you could change the process that produces the two files to write the data directly to one form of DB or the other.

        You haven't provided any clues to the numbers of records (paths), or their sizes, or the key lengths, or the frequencies of the tasks, so the best that anyone can do is give you a very general reply.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        Lingua non convalesco, consenesco et abolesco.
        Rule 1 has a caveat! -- Who broke the cabal?
        Is it more easily understood now? Um, sorry, but no I'm afraid it is not.

        Go ahead and try something out. If what you try works to your satisfaction, great. If not, and if you would like us to help, show us the code you tried, along with a minimal sample of input data from each file.

        And if you can, try to state clearly what your "final output" is supposed to be.

Re: huge file->hash
by polettix (Vicar) on Apr 20, 2005 at 13:07 UTC
    Hi, I have to admit that I really fought to arrive to the very end of the post, maybe due to poor formatting - it's only a big chunk of text. Are you able to read it and understand it in a while? You'd better read Writeup Formatting Tips about this.

    I'd suggest you to post you a basic implementation of your algorithm instead of simply describing it. Please also post a minimal set of data to better explain your needs and ideas, otherwise you can easily understand that what's pretty clear to you remains quite foggy for us (at least, me). Please ensure that the data you post is the very minimum required for us to fully understand your problem without drowning in a sea of data.

    Each monk, I think, values its mental bandwidth very much, and asking them to waste it could result in your questions not being answered. This would be a lose-lose situation: you wouldn't have your answer, and we couldn't help you. This is why How do I post a question effectively? exists.

    Flavio (perl -e "print(scalar(reverse('ti.xittelop@oivalf')))")

    Don't fool yourself.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://449565]
Approved by inman
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others surveying the Monastery: (9)
As of 2014-12-26 05:22 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (165 votes), past polls