Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

Large file, multi dimensional hash - out of memory

by Anonymous Monk
on May 15, 2013 at 14:17 UTC ( #1033692=perlquestion: print w/ replies, xml ) Need Help??
Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I have a file which is about 50 million lines long containing two fields per line. The first is a name field is at maximum 35 characters long, followed by a file extension. I'm trying to figure out a way to give me a report of the distinct names and a breakdown of the numbers of file extensions. My attempt can't even generate the hash without dying due to lack of memory. Is there any way I can make this work?
use strict; use warnings; open(my $fh, "<", "input.txt") or die "cannot open < input.txt: $!"; my %duplicates; while (my $line = <$fh>) { chomp $line; my ($fname, $fext) = split(' ',$line); $duplicates{$fname}{$fext}++; }
data:
/foo/bar/baz/123 aaa /foo/bar/baz/123 aab /foo/bar/baz/123 aac /foo/bar/baz/124 aaa /foo/bar/baz/124 aab

Comment on Large file, multi dimensional hash - out of memory
Select or Download Code
Re: Large file, multi dimensional hash - out of memory
by kennethk (Monsignor) on May 15, 2013 at 14:37 UTC
    In my experience, the next step after a massive hash fails is to go to a database. DBD::SQLite is fast and easy; see Databases made easy for an intro, and http://www.w3schools.com/sql/ for quick reference.

    Update: Demo code:

    use strict; use warnings; use DBI; require DBD::SQLite; unlink 'track.db'; my $db = DBI->connect( 'dbi:SQLite:dbname=track.db', '', '', { RaiseError => 1, AutoCommit => 1, }, ); $db->do(<<EOSQL); CREATE TABLE tracking ( fname varchar(255), fext varchar(255) ) EOSQL my $count_query = $db->prepare(<<EOSQL); SELECT COUNT(*) FROM tracking WHERE fname=? AND fext=? EOSQL my $insert_query = $db->prepare(<<EOSQL); INSERT INTO tracking (fname, fext) VALUES(?,?) EOSQL open(my $fh, "<", "input.txt") or die "cannot open < input.txt: $!"; while (my $line = <$fh>) { chomp $line; my ($fname, $fext) = split(' ',$line); $count_query->execute($fname, $fext); my ($count) = $count_query->fetchrow_array; $count_query->finish; if (!$count) { print "$fname $fext\n"; $insert_query->execute($fname, $fext); } } $db->disconnect;

    Yeah, it's longer, but you are going to have to do work on disk if you can't hold it in memory. Feel like implementing your own Merge sort instead?


    #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.

Re: Large file, multi dimensional hash - out of memory
by davido (Archbishop) on May 15, 2013 at 14:39 UTC

    Your input data set appears to be sorted. Is that intentional? If that's the case you're in luck. You only need to keep track of a single name at a time. Walk through the input file line by line. Each time the name changes, write the previous name to an output file along with an extension summary.


    Dave

      great idea. yes the data is sorted. How can I dump that data to a file and free up the hash so I don't run out of memory?

        Here's an example of what I was suggesting.

        my $current_name = ''; my %extensions = (); while ( my $line = <DATA> ) { chomp $line; my( $name, $ext ) = split ' ', $line; if( $name ne $current_name ) { print "($current_name) => [", join( ', ', sort keys %extensions ), + "]\n" if length $current_name; $current_name = $name; %extensions = (); } $extensions{$ext}++; } print "($current_name) => [", join( ', ', sort keys %extensions ), "]\ +n"; __DATA__ /foo/bar/baz/123 aaa /foo/bar/baz/123 aab /foo/bar/baz/123 aac /foo/bar/baz/124 aaa /foo/bar/baz/124 aab

        The memory footprint will only be as big as the maximum number of extensions per name.


        Dave

Re: Large file, multi dimensional hash - out of memory
by BrowserUk (Pope) on May 15, 2013 at 14:41 UTC

    You could reduce the memory requirement to around 1/4 by not using 2 levels of hash. A single level will do the job:

    use strict; use warnings; open(my $fh, "<", "input.txt") or die "cannot open < input.txt: $!"; my %duplicates; while (my $line = <$fh>) { chomp $line; ++$duplicates{$line}; }

    But that will still require around 4GB to build the 50e6 key hash. Better than 16GB, but you will still run out of memory if you are using a 32-bit Perl (unless you have a very high proportion of duplicates. eg. >50%)

    As you say the data is presorted, investigate the uniq command


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      thanks, it is a 64bit perl, but not long after passing the 4GB mark it runs out of memory anyway. I guess we need a bigger boat

        As the input file is sorted, uniq infile > outfile ought to do the job very quickly.

        If for some reason you don't have the uniq command available, try

        #! perl -w use strict; my $last = <>; while( <> ) { print $last if $_ ne $last; $last = $_; } __END__ thisscript infile > outfile

        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Large file, multi dimensional hash - out of memory
by hdb (Prior) on May 15, 2013 at 17:09 UTC

    Your question is somewhat ambiguous to me (not being a native English speaker). Do you want to count file extensions by filename or generally? In the latter case, you would have to use two separate hashes, one for filenames and one for extensions. And they would have less entries.

Re: Large file, multi dimensional hash - out of memory (swapping)
by LanX (Canon) on May 15, 2013 at 22:25 UTC
    Just for completeness an approach for the unsorted case.

    /foo/bar/baz/123 aaa /foo/bar/baz/123 aab /foo/bar/baz/123 aac /foo/bar/baz/124 aaa /foo/bar/baz/124 aab

    You could still be able to profit from automated swapping of unused data-structures by using more hierarchies of hashes and splitting the paths at '/'.

    For instance $duplicates{foo}{bar}{baz}{123}++ would only involve some comparatively small hashes kept in memory.

    The drawback would be that swapping to HD costs time! That's why you should try to minimize swapping by concentrating on only some hashes currently loaded.

    That means you would need multiple runs scanning your file, while you concentrate on a part of "unfinished" hashes (e.g. all path starting with /foo/bar and so on)

    I think always loading 1 million lines into memory (i.e. an array) and scanning them in multiple runs could lead to a balanced mix of time and memory complexity.

    And already scanned elements could be marked by delete $arr[$index]

    Cheers Rolf

    ( addicted to the Perl Programming Language)

    UPDATE

    ) I just realised that using multiple hash-levels elegantly solves the case of sorted input. You won't need multiple runs, the sorted input will effectively minimize the amount of swapped hashes.

Re: Large file, multi dimensional hash - out of memory
by sundialsvc4 (Abbot) on May 16, 2013 at 00:15 UTC

    DavidO’s suggestion seems the most-appropriate one to me ... such that “out of memory” ought never be an issue here.

    If the file is known to be sorted by filename, then the entire set of records for any given name is known to be adjacent, and you only need to remember “the last name seen” and a hash to accumulate and/or to count extensions.   When the name changes, the results for the previous current-name are output, the new current-name is captured, and the hash is reset ... taking proper care to ensure that nothing is lost from either group.

    As an added bonus, the output file will always emerge sorted the same way.   So, you can devise an entire processing sequence that relies upon sort-order, and every single one of them might be “very fast and virtually RAM-free.”   You can furthermore rely upon the fact that, for any gaps that you detect in the sequence, there will be no records anywhere in the file that fall within that gap.

    This is a very powerful, and very old, technique that was used as far back as Herman Hollerith’s tabulation of the US Census using punched cards.   Especially when large amounts of data had to be processed on machines whose memory capacity was all but non-existent, the so-called “unit-record processing” was the only way such jobs could be done.   Even today, the assumption (and maintaining) of a certain known sort-order is a powerful technique that is ideal for an application such as this.   If you know that the file is already sorted, you don’t need much memory at all.   (If you knew that it was sorted both by name and extension ... and you should check for this, because there’s a pretty good chance that it’s so!! ... your problem might require almost-no memory at all:   the files are adjacent, and within each file, so are the extensions.)

    When you are dealing with data files of this magnitude, and especially if you need to do several things to the data in successive steps, it might (or, might not) be justifiable to sort the data, just to pick-up an accumulation of efficiencies downstream.   (The only way to establish that is to empirically test it, on your own hardware and network.)   In any case, if the producer of that file gave it to you with a guaranteed sort-order, they just dealt you a huge favor.

    In any case, your code must be defensive:   if your algorithm fundamentally relies upon the proposition that the data is sorted, you must not assume.   You’re looking for the key-values to change, and that the difference between old and new is “greater than.”   If it is not, your code must detect it and die ... because the results won’t be reliable and no one else is in the position to know.

    Note: On a Unix-ish system, the shell command sort -c filename can be used to quickly discover, once and for all, whether the entire file is or is not sorted.   (It can also be used as a quick “sanity-check” in a script that invokes this Perl program, since it does understand space-delimited files and can be instructed to consider the entire file or only a particular key or keys.   “Sniff” the data at the start of the job, and sound the alarm before eating spoiled meat, because any human can make a mistake.)

Re: Large file, multi dimensional hash - out of memory
by Anonymous Monk on May 16, 2013 at 19:05 UTC
    What do you expect your output to look like? I'm unclear what you mean by "distinct names" -- i.e. full path? basename?

    Your code isn't doing any useful summarizing as it reads, and that's the key to staying within a reasonable memory footprint without using a database.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1033692]
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others taking refuge in the Monastery: (9)
As of 2014-12-29 03:37 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (184 votes), past polls