Re: Memory utilization and hashes

code updated and tested

#!/usr/bin/perl

use warnings;
use strict;
$|++;

use JSON;

my $l;
my @vals;
my $json;
my %pairs;

while (<>) {
    $l = $_;
    chomp $l;
    @vals = split /;/, $l;
    if ($vals[0] =~ /Query/) {
        $pairs{$vals[1]}{$vals[2]} = $vals[3];
    } elsif ($vals[0] =~ /Answer/) {
        $pairs{$vals[1]}{$vals[2]} = $vals[3];
        $json = encode_json $pairs{$vals[1]};
        print $json."\n";
        delete $pairs{$vals[1]};
    }
}
[download]

[root@hadron ~]# ./t-1207429.pl t-1207429.txt
{"ip":"1.2.3.4","host":"www.example.com"}
{"ip":"2.3.4.5","host":"www.cnn.com"}
{"ip":"3.4.5.6","host":"www.google.com"}
[download]

The real question is whether, if running this against 100GB file with >500000 hash entries, will delete actually reduce the size of the has or not?

Or is there a leaner way to do this?

Comment on Re: Memory utilization and hashes Select or Download Code

Replies are listed 'Best First'.
Re^2: Memory utilization and hashes by pryrt (Abbot) on Jan 17, 2018 at 21:57 UTC
`delete` will definitely reduce the size of the hash, because every time you get a first answer for a given query, it will delete the entire entry for that query. Of course, if there's a second answer for the query, it cannot find the entry for the query, so it creates it again, without the `host` key. You might want to expand your example data to include a sample with more than one response (out of order) for the same query (for example, query 2, with two or three rows of answers), and display the output. Then tell us what you want the real output to be, given that set of data. Something like: `Query;1;host;www.example.com Answer;1;ip;1.2.3.4 Query;2;host;www.cnn.com Query;3;host;www.google.com Answer;2;ip;2.3.4.5 Answer;3;ip;3.4.5.6 Answer;2;ip;9.8.7.6 Answer;2;ip;5.4.3.2 ----------------------- {"host":"www.example.com","ip":"1.2.3.4"} {"ip":"2.3.4.5","host":"www.cnn.com"} {"ip":"3.4.5.6","host":"www.google.com"} {"ip":"9.8.7.6"} {"ip":"5.4.3.2"}` [download] Also, for debugging, add `print "DEBUG: ", encode_json \%pairs;` just before the end of the while loop: that will let you watch the hash grow and shrink, and will tell you whether or not it's doing the right thing	[reply] [d/l] [select]
Re^3: Memory utilization and hashes by bfdi533 (Friar) on Jan 17, 2018 at 22:12 UTC
Right, so it is much more complicated in my real code. I create an array for the multiple answers as such and am doing some funky checks to print out the info because the index number can be reused. So, say index 2 has an answer provided, then 2 can be re-used in another query. I then dump what is left of the has at the end of the code for those items that did not get re-used and replaced. Like I said, it is really messy in "real life". I will provide example code that is closer to my real code shortly but my real question is, I supposed, if a hash is the right way to do this after all due to memory issues and such.	[reply]
Re^4: Memory utilization and hashes by bfdi533 (Friar) on Jan 17, 2018 at 22:16 UTC
I did try to use Devel::Size to see if the memory actually goes down so am writing the size of the has to a log file every time I "dump" a line and the size never decreases since I have been testing it. The here is an example. First column is line count into the file being processed, the second is the index (equivalent to $vals1) and the last the size of the %pairs hash. Here the size is 122MB for the %pairs hash ... ... 424872: e5c651161 (122480629) 424875: 6d6148148 (122481928) 424886: 108038067 (122484667) 424890: 4db238067 (122487257) 424892: 502c57487 (122488556) 424895: c53c57539 (122489855) 424896: 578757487 (122489855) 424923: 300959147 (122495178) 424928: a9bb41168 (122496165) 424936: dfc243245 (122499555) 424937: 0a9534098 (122499555) 424944: 666b34098 (122501654) 424954: 494949982 (122504073) 424956: 182939296 (122505372) 424960: c1ad46207 (122507962) 424962: 3d1249982 (122507962) 424968: 3c1336561 (122512355) 424974: b24939296 (122514993) 424987: 3c7b36561 (122517700) 424998: eb1544993 (122520311) 425005: 818a49369 (122521727) ... [download]	[reply] [d/l]
Re^5: Memory utilization and hashes by Laurent_R (Canon) on Jan 17, 2018 at 22:40 UTC
Re^6: Memory utilization and hashes by bfdi533 (Friar) on Jan 17, 2018 at 22:47 UTC
Re^2: Memory utilization and hashes by QM (Parson) on Jan 26, 2018 at 10:16 UTC
I don't think `delete` shrinks the hash per se. Certain hash admin is performed to mark hash entries unused, etc. Some linked memory (references) may become free. But the only way to shrink the hash is to make a new hash, and copy over the "trimmed" old hash, and then throw away the old hash. You should be able to make a test case for this, showing the size of a hash does not shrink after deletes, and that total process memory doesn't shrink, but only grows. It is up to you and Perl to make efficient use of an ever growing pile of memory allocated by the OS. -QM -- Quantum Mechanics: The dreams stuff is made of	[reply] [d/l]