Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses

Memory utilization and hashes

by bfdi533 (Friar)
on Jan 17, 2018 at 20:53 UTC ( #1207429=perlquestion: print w/replies, xml ) Need Help??

bfdi533 has asked for the wisdom of the Perl Monks concerning the following question:

I have some code which reads from a file (sometimes 100+ GB) and has to combine rows to create a consolidated output. I used to process the entire file into a hash and then dump the hash at the end of the program.

The problem with that was, of course, with the very large files, the hash would grow humongous and the program would consume all memory in the system causing it to crash.

So, trying to solve this problem, I changed the code to output the data as it went, doing my best to make sure that I got all of the row data for consolidation and the did a delete on the hash, thinking I was clearing up memory. But, this does not appear to be the case. Example code:

my $l; my @vals; my $json; while (<>) { $l = $_; chomp $l; @vals = split /;/, $l; if ($vals[0] =~ /Query/) { $pairs{$vals[1]}{$vals[2]} = $vals[3]; } elsif {$vals[0] =~ /Answer/) { $pairs{$vals[1}{$vals[2]} = $vals[3]; $json = encode_json $pairs{$vals[1]}; print $json."\n"; delete $pairs{$vals[1]}; } }
Example data:
Query;1;host; Answer;1;ip; Query;2;host; Query;3;host; Answer;2;ip; Answer;3;ip;

Does delete actually remove the storage from the hash?

Does the memory the hash is using actually get reduced after delete?

Is there a better way to do this?

Code updated above per the first reply.

Replies are listed 'Best First'.
Re: Memory utilization and hashes
by Laurent_R (Canon) on Jan 17, 2018 at 23:08 UTC
    Given all what you've said so far, especially that it seems you can't never be sure you have collected all the answers for a given query, I think I would probably go for a completely different approach.

    I would use the OS's sort utility to reorganize the input file, sorting on the id number (second field). I would then read all the records for a given id number (storing them in an array or a hash), collect the information from the query record and use it to process the answer records. Once I've finished processing an id number, clear the data structures and start again with the next if number lines.

    This way, the memory usage of your Perl program will be limited to the maximum number of lines there can be for one id number. (Of course, the sort phase will use a lot of memory, but the *nix sort utilities know well how to handle that, they write temporary data on disk to avoid memory overflow.)

    Sorting your large file will take quite a bit of time, but at least you're guaranteed never to exceed your system's available memory.

    An alternative would be to use a database, but I doubt it would be faster.

      Turns out that the unix sort was exactly the prior step that was missing to help speed this up. With a correct choice of keys, the file now is in sequential order by "ID" and when a new Query comes in, it is now easy to check if the current "ID" = the prior "ID" and flush any accumulated hash entries and continue. This keeps the hash to, in testing, no more than 3-7 'extra' keys for each set of "ID"s in the file and then dumps the set.

      Memory usage has stayed small and the processing is now approx 1/4 the total time of the prior runs.

        What does this sample of data you provided look like after the *nix sort ?

        Query;1;host; Answer;1;ip; Query;2;host; Query;3;host; Answer;2;ip; Answer;2;ip; Query;4;host; Answer;4;ip; Answer;3;ip; Query;2;host; Answer;4;ip; Answer;2;ip;

        For what is is worth, and if anyone is interested, here are some stats from the processing after I introduced the *nix sort before my perl script.

         elapsed time    | type      |rows after| rows before| pct   | rows/second 
                         |           |processing| processing |smaller| 
         00:03:05.98667  | dns       |  1791555 |    4614653 | 38.82 | 24811.7405403301
         00:03:50.106203 | dns       |  2262736 |    5822777 | 38.86 |  25304.737221708
         00:04:51.91195  | dns       |  2733705 |    7039758 | 38.83 | 24116.0322487654
         00:05:36.348691 | dns       |  3208365 |    8266995 | 38.81 | 24578.6447850335
         00:06:33.947878 | dns       |  3683419 |    9490938 | 38.81 | 24091.8622234589
         00:07:35.58667  | dns       |  4155971 |   10705249 | 38.82 | 23497.7221787459
         00:08:25.086565 | dns       |  4633553 |   11946401 | 38.79 | 23652.1852447214
         00:09:07.952743 | dns       |  5109618 |   13183845 | 38.76 | 24060.1861536808
         00:10:16.250404 | dns       |  5596902 |   14441405 | 38.76 | 23434.3132373833
         00:10:54.578348 | dns       |  6070888 |   15662586 | 38.76 | 23927.7483709253
         00:11:39.012952 | dns       |  6547181 |   16896184 | 38.75 | 24171.4891714911
         00:12:43.13814  | dns       |  7019314 |   18113219 | 38.75 | 23735.1772249255
         00:13:34.23578  | dns       |  7499659 |   19365386 | 38.73 | 23783.5114541392
         00:14:35.939246 | dns       |  7973633 |   20591767 | 38.72 | 23508.2137191967
         00:15:12.223167 | dns       |  8448494 |   21815382 | 38.73 | 23914.5231004641
         00:15:52.951662 | dns       |  8923786 |   23043433 | 38.73 | 24181.1142357817
         00:17:45.637116 | dns       |  9402613 |   24278649 | 38.73 | 22783.2238906363
         00:17:52.402055 | dns       |  9880079 |   25516948 | 38.72 | 23794.1990888856
Re: Memory utilization and hashes
by BrowserUk (Pope) on Jan 17, 2018 at 21:38 UTC

    Your posted code will not run. You have an error in a variable name here: chomp $;. You have unbalanced [] here:%pairs{$l[1}{$l[2]} = $l[3];. And hash element references should start with $ not %.

    In addition, you assign $_ to $l, use it as a scalar: @vals = split /;/, $l;, and then index it as an array:%pairs{$l[1]}{$l[2]} = $l[3];

    Use strict; use warnings; Only post code that compiles.

    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". The enemy of (IT) success is complexity.
    In the absence of evidence, opinion is indistinguishable from prejudice. Suck that fhit
Re: Memory utilization and hashes
by Laurent_R (Canon) on Jan 17, 2018 at 21:27 UTC
    I don't understand your code.
    while (<>) { $l = $_; chomp $; # you probably want to +chomp $l, or possibly $_ (but you no longer use $_), but not $ @vals = split /;/, $l; # you split your line i +nto @vals, but no longer use that variable. Besides, # declaring @vals with +my would be good practice if ($l =~ /Query/) { # you could use somethi +ng like: if $vals[0] eq "Query" %pairs{$l[1]}{$l[2]} = $l[3]; # where are $l[1], $l[2 +] and $l[3] coming from? Also, %pairs{...} is probably a syntax error +. } elsif {$l =~ /Answer/) { # again, you could use: + if $vals[0] eq "Answer". Also, "elsif {..." is a syntax error. %pairs{$l[1}{$l[2]} = $l[3]; # again, where are $l[1 +], $l[2] and $l[3] coming from? Also a syntax error. $json = encode_json $pairs{$l[1]}; # given the previous co +de, I doubt that you really want to encode $pairs{$l[1]} print $json."\n"; # is you intent to prin +t to the screen? delete $pairs{$l[1]}; # not sure it's needed, + since you just reuse the same variable in the next iteration } }
    Also, I don't understand what's going on when you have two queries or two answers in a row, as in your data example.

    With the code you're showing, the hash should not grow significantly, even without the call to delete. (Update:: but this is no longer true with the updated code posted below.)

      Sorry for the typos in the code; fixing them.

      My actual data consists of data from several hundred MB to several hundred GB so that sample data set is just a sample of the sort of thing I am processing.

      The two queries and two answers in a row is what my real world data contains, specifically there can be anywhere from 1 to n answers for each query and the queries and answers occur in any order and the only guarantee is that the answer will follow (sometime later) the query it goes with.

      Max rows in files to process = 31291204, average lines in files 8707186.

        Just to keep track:
        my $l; # all these three variables should probably better decla +red within the my @vals; # while loop. Only %pairs probably need to be declared b +efore the while my $json; while (<>) { $l = $_; chomp $l; @vals = split /;/, $l; if ($vals[0] =~ /Query/) { $pairs{$vals[1]}{$vals[2]} = $vals[3]; # %pairs isn't decla +red anywhere } elsif {$vals[0] =~ /Answer/) { # syntax error: elsi +f { should be elsif ( $pairs{$vals[1}{$vals[2]} = $vals[3]; $json = encode_json $pairs{$vals[1]}; # what do you think +is the content of $pairs{$vals[1]}? Probably not what you want to enc +ode. print $json."\n"; delete $pairs{$vals[1]}; } }
        This will still not compile.

        Do yourself a favor. Use the following pragmas:

        use strict; use warnings;
        specifically there can be anywhere from 1 to n answers for each query
        Then you can't delete your hash entries as you go, because when a second answer comes of a given query, you no longer have the information from the query available.
        Even with the fixes that you did in the original post, you still have several syntax errors.
Re: Memory utilization and hashes
by poj (Abbot) on Jan 17, 2018 at 22:11 UTC

    Do you need to store the answers ?

    #!perl use strict; use warnings; use JSON; my %host = (); while (<DATA>) { chomp; my @f = split /;/, $_; if ($f[0] eq 'Query') { $host{$f[1]} = $f[3]; } elsif ($f[0] eq 'Answer') { my $json = encode_json { host=>$host{$f[1]},$f[2]=>$f[3] }; print $json."\n"; delete $host{$f[1]}; } } __DATA__ Query;1;host; Answer;1;ip; Query;2;host; Query;3;host; Answer;2;ip; Answer;3;ip;

      Since I need all of the query and answers info on one line in the output, yes, I need to collect them up until I have all of the answers.

      here is a more closely working example of the code. I was trying to keep it simple and focus on the memory usage of the hash but here we are.

      #!/usr/bin/perl use warnings; use strict; $|++; use JSON; my $l; my @vals; my $json; my %pairs; my %pind; my %flush; while (<DATA>) { $l = $_; chomp $l; @vals = split /;/, $l; if ($vals[0] =~ /Query/) { if (! $pairs{$vals[1]}) { $pind{$vals[1]} = 0; } if (!defined $flush{$vals[1]}) { $flush{$vals[1]} = " "; } elsif ($flush{$vals[1]} ne $vals[1]) { $json = encode_json $pairs{$vals[1]}; print "DEBUG: Flushing \"complete\" answer\n"; print $json."\n"; delete $pairs{$vals[1]}; $flush{$vals[1]} = $vals[1]; $pind{$vals[1]} = 0; } $pairs{$vals[1]}{$vals[2]} = $vals[3]; $pairs{$vals[1]}{id} = $vals[1]; } elsif ($vals[0] =~ /Answer/) { $pairs{$vals[1]}{$vals[0]}[$pind{$vals[1]}++]{$vals[2]} = $val +s[3]; } } print "DEBUG: output remaining data ...\n"; foreach my $key (keys %pairs) { $json = encode_json $pairs{$key}; print $json."\n"; } __DATA__ Query;1;host; Answer;1;ip; Query;2;host; Query;3;host; Answer;2;ip; Answer;2;ip; Query;4;host; Answer;4;ip; Answer;3;ip; Query;2;host; Answer;4;ip; Answer;2;ip;
      Results in:
      DEBUG: Flushing "complete" answer {"Answer":[{"ip":""},{"ip":""}],"id":"2","host":"www.cnn"} DEBUG: output remaining data ... {"Answer":[{"ip":""},{"ip":""}],"id":"4","host":""} {"Answer":[{"ip":""}],"id":"1","host":""} {"Answer":[{"ip":""}],"id":"3","host":""} {"Answer":[{"ip":""}],"id":"2","host":""}

        Same idea using one hash.

        #!/usr/bin/perl use strict; use warnings; use JSON; my %query = (); while (<DATA>) { chomp; next unless /\S/; # skip blank lines my ($s1,$n,$s2,$v2,undef) = split ';',$_,5; if ($s1 eq 'Query') { if (exists $query{$n}){ # print and reuse output($n); } $query{$n} = [$v2]; } elsif ($s1 eq 'Answer') { push @{$query{$n}},$v2; } } # remaining output($_) for keys %query; sub output { my $n = shift; my $host = shift @{$query{$n}}; print encode_json { id=>$n,host=>$host,ip => $query{$n} }; print "\n"; }
Re: Memory utilization and hashes
by karlgoethebier (Abbot) on Jan 18, 2018 at 10:19 UTC
    "... 100+ GB ...combine rows...consolidated output..."

    Life is hard - so perhaps you better go with sqlite?

    See also Re: Reading HUGE file multiple times and Limits In SQLite.

    Best regards, Karl

    P.S.: And remember:

    #!/usr/bin/env perl use strict; use warnings; use feature qw(say); use Try::Tiny; # say $0; try { ...; } catch { say $_} __END__ karls-mac-mini:playground karl$ ./ Unimplemented at ./ line 10.

    «The Crux of the Biscuit is the Apostrophe»

    perl -MCrypt::CBC -E 'say Crypt::CBC->new(-key=>'kgb',-cipher=>"Blowfish")->decrypt_hex($ENV{KARL});'Help

Re: Memory utilization and hashes
by bfdi533 (Friar) on Jan 17, 2018 at 21:44 UTC

    code updated and tested

    #!/usr/bin/perl use warnings; use strict; $|++; use JSON; my $l; my @vals; my $json; my %pairs; while (<>) { $l = $_; chomp $l; @vals = split /;/, $l; if ($vals[0] =~ /Query/) { $pairs{$vals[1]}{$vals[2]} = $vals[3]; } elsif ($vals[0] =~ /Answer/) { $pairs{$vals[1]}{$vals[2]} = $vals[3]; $json = encode_json $pairs{$vals[1]}; print $json."\n"; delete $pairs{$vals[1]}; } }
    [root@hadron ~]# ./ t-1207429.txt {"ip":"","host":""} {"ip":"","host":""} {"ip":"","host":""}

    The real question is whether, if running this against 100GB file with >500000 hash entries, will delete actually reduce the size of the has or not?

    Or is there a leaner way to do this?

      delete will definitely reduce the size of the hash, because every time you get a first answer for a given query, it will delete the entire entry for that query. Of course, if there's a second answer for the query, it cannot find the entry for the query, so it creates it again, without the host key.

      You might want to expand your example data to include a sample with more than one response (out of order) for the same query (for example, query 2, with two or three rows of answers), and display the output. Then tell us what you want the real output to be, given that set of data. Something like:

      Query;1;host; Answer;1;ip; Query;2;host; Query;3;host; Answer;2;ip; Answer;3;ip; Answer;2;ip; Answer;2;ip; ----------------------- {"host":"","ip":""} {"ip":"","host":""} {"ip":"","host":""} {"ip":""} {"ip":""}

      Also, for debugging, add print "DEBUG: ", encode_json \%pairs; just before the end of the while loop: that will let you watch the hash grow and shrink, and will tell you whether or not it's doing the right thing

        Right, so it is much more complicated in my real code. I create an array for the multiple answers as such and am doing some funky checks to print out the info because the index number can be reused. So, say index 2 has an answer provided, then 2 can be re-used in another query. I then dump what is left of the has at the end of the code for those items that did not get re-used and replaced.

        Like I said, it is really messy in "real life".

        I will provide example code that is closer to my real code shortly but my real question is, I supposed, if a hash is the right way to do this after all due to memory issues and such.

      I don't think delete shrinks the hash per se. Certain hash admin is performed to mark hash entries unused, etc. Some linked memory (references) may become free.

      But the only way to shrink the hash is to make a new hash, and copy over the "trimmed" old hash, and then throw away the old hash.

      You should be able to make a test case for this, showing the size of a hash does not shrink after deletes, and that total process memory doesn't shrink, but only grows. It is up to you and Perl to make efficient use of an ever growing pile of memory allocated by the OS.

      Quantum Mechanics: The dreams stuff is made of

Re: Memory utilization and hashes
by pwagyi (Monk) on Jan 18, 2018 at 02:39 UTC
    I think it may be appropriate to use database.
Re: Memory utilization and hashes
by QM (Parson) on Jan 26, 2018 at 10:11 UTC
    I have used DBM::Deep to store native Perl hashes on disk persistently. And hashes of hashes, and hashes of arrays of ... you get the idea. It solves your problem.

    It will be some factor slower (say, 5-10x) because of disk writes. There is a maximum file size, so depending on your data, you may need multiple subhashes each mapped to its own file.

    But if your problem is easily solved another way, staying in memory, you'll probably be happier.

    Quantum Mechanics: The dreams stuff is made of

Re: Memory utilization and hashes
by ikegami (Pope) on Jan 18, 2018 at 19:41 UTC

    You could use a database (like SQLite).

    Upd: Woops, I just noticed someone already suggested this.

Re: Memory utilization and hashes
by Anonymous Monk on Jan 18, 2018 at 15:39 UTC
    SQLite is exactly what I would recommend in this case: "it's just a disk file," but it's ideally suited to this sort of thing. You can import data very rapidly into an SQLite table, and you can also use its ATTACH DATABASE feature to work with more than one database (file ...) at a time. It has a very fast indexer and a good query engine, and it won't blink at all when dealing with this number of rows. And, since you can easily use them with spreadsheets and so-forth, you might well find that your need for custom programming is severely reduced or even eliminated. Hands down, this is the way I would do this.

      Not a bad thought but you might notice that I had an array in my hash which I needed in the JSON output:


      This is certainly doable in a database (SQLite or PostgreSQL) but would involve another table and then a complicated query to get into the proper format to make it into JSON.

      Not as easy as it sounds in my specific use case, but certainly something I had considered at one point.

      Thanks for the pointer in this direction and the friendly reminder.

      2018-01-28 Athanasius changed pre to code tags

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1207429]
Approved by Discipulus
Front-paged by haukex
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others meditating upon the Monastery: (7)
As of 2021-04-22 15:18 GMT
Find Nodes?
    Voting Booth?

    No recent polls found