Beefy Boxes and Bandwidth Generously Provided by pair Networks Cowboy Neal with Hat
Problems? Is your data what you think it is?
 
PerlMonks  

How to save memory, parsing a big file.

by idle (Friar)
on Mar 01, 2006 at 09:56 UTC ( [id://533663]=perlquestion: print w/replies, xml ) Need Help??

This is an archived low-energy page for bots and other anonmyous visitors. Please sign up if you are a human and want to interact.

idle has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks.
I wrote a little script, for sorting out some traffic statistics. It parse log file putting all into big hash(I guess its gettin really big) and then print formatted output. The problem is when size of log files exceed about 200mb, it eats all the memory(512mb) and I got error: "Out of memory during request for 1012 bytes, total sbrk() is 536797184 bytes!"
What should I do with that?
Here is my code:
while (<LOG>){ my ($source, $sport, $to, $dport, $proto, $packs, $bytes) = sp +lit; if ( $total{"$source$dport"}{'from'} ) { # if source exist $connects{"$source$dport"}++; if ($total{"$source$dport"}{'to'}) { # and destination + exist - summarizing $total{"$source$dport"}{'bytes'}+=$bytes; } else { print "oops"; } # this seem we can't get here, +but I don't get why... } else { $total{"$source$dport"} = { "from" => $source, "to" => +"$to:$dport", "bytes" => $bytes } } $total+=$bytes; }
Log look like that:
10.0.1.10 2484 10.10.10.5 445 6 1 48

Replies are listed 'Best First'.
Re: How to save memory, parsing a big file.
by dragonchild (Archbishop) on Mar 01, 2006 at 10:17 UTC
    use DBM::Deep; tie my %total, 'DBM::Deep', $filename;

    Now, you're limited to disk size, not RAM size.


    My criteria for good software:
    1. Does it work?
    2. Can someone else come in, make a change, and be reasonably certain no bugs were introduced?
      Is there a way to hold the entire log contents itself in to some hash or array without utilising much memmory ,since the file size is too large
        The node you replied to is the answer to your question. DBM::Deep is exactly the way to have a hash use disk instead of RAM.

        My criteria for good software:
        1. Does it work?
        2. Can someone else come in, make a change, and be reasonably certain no bugs were introduced?
Re: How to save memory, parsing a big file.
by mirod (Canon) on Mar 01, 2006 at 10:21 UTC

    If it is too big to fit in memory... then you have to use the disk! The easiest to use is probably GDBM_File, which will let you tie the hash to a disk file (but you will need to serialize the values of the hash). You could also go for a full DBMS, DBD::SQLite is very convenient, as the DB is a single file. it is pretty fast too.

    In any case you will need to rewrite your code, and it will probably take much longer to run, there is no miracle here!

      DBM::Deep will avoid both those problems. See my reply above.

      My criteria for good software:
      1. Does it work?
      2. Can someone else come in, make a change, and be reasonably certain no bugs were introduced?
Re: How to save memory, parsing a big file.
by duff (Parson) on Mar 01, 2006 at 10:28 UTC

    I don't know if this is still the case but it used to be that when perl grew data structures for you, it would double the amount of memory it was using each time even if you really only needed just 1 more element. You could try to give your hash(es) a good number of buckets to start with by assigning to keys thusly:

    keys %total = 500;
    Where 500 is the number of buckets you think your hash is likely to have (you'll have to determine this empirically).
Re: How to save memory, parsing a big file.
by BrowserUk (Patriarch) on Mar 01, 2006 at 11:20 UTC

    BTW. You should probably be using exists in your conditions to prevent autovivifying keys that you don't use:

    if( exists $total{"$source$dport"}{'from'} ) {

    Without the exists, you will autovivify a key ("$source$dport") in %total that will have an anonymous hash assigned to it.

    print exists $h{a} ? 1 : 0;; 0 print $h{a}{b} ? 1 : 0;; 0 print exists $h{a} ? 1 : 0;; 1 print $h{a};; HASH(0x18eee74)

    I don't think it will make much if any difference to your memory consumption in this case as you will go on to assign to the key anyway, but getting into the habit of using exists is a good thing in the long run as even empty anonymous hashes consume a fair amount of space.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: How to save memory, parsing a big file.
by graff (Chancellor) on Mar 01, 2006 at 22:02 UTC
    Definitely go with a DBM approach as described above, to move the hash structure to disk. Apart from that, I'm wondering why you use two different hashes with identical keys (%total and %connects), and why you test a condition that would obviously never be false (if $total{foobar}{from} exists, there's no point testing whether $total{foobar}{to} doesn't exist, since "from" and "to" are both assigned at the same time).

    I think the following would be equivalent to the OP code in terms of what it does, but might take less memory and might run a bit faster:

    while (<LOG>) { my ($source, $sport, $to, $dport, $proto, $packs, $bytes) = split; my $key = "$source$dport"; if ( exists( $total{$key}{from} )) { $total{$key}{connects}++; $total{$key}{bytes} += $bytes; } else { $total{$key} = { from => $source, to => "$to:$dport", bytes => $bytes, }; # maybe should set 'connects => 1' as well? } $total += $bytes; }

    Here are a few (potentially meaningless) benchmarks about the trade-off between more top-level (simple, flat) hashes vs. a single top-level hash with more sub-hash keys (I put a "sleep" in there so I could study the memory/time consumption once the hashes were filled):

    perl -e '$k="aaaaa"; for $i (1..1_000_000) { $h1{$k}={foo=>"bar",bar=>"foo",iter=>$i}; $h1{$k}{total}++; $k++} sleep 20' ## consumes 344 MB in ~14.4 sec perl -e '$k="aaaaa"; for $i (1..1_000_000) {$h1{$k}={foo=>"bar",bar=>"foo",iter=>$i}; $h2{$k}++; $k++} sleep 20' ## consumes 352 MB in ~15.0 sec perl -e '$k="aaaaa"; for $i (1..1_000_000) {$h1{$k}={foo=>"bar",bar=>"foo"}; $h2{$k}++; $h3{$k}=$i; $k++} sleep 20' ## consumes 360 MB in ~16.5 sec
    So given that you are using one HoH already, there's a slight advantage in not creating a second (or third) hash with the same set of primary keys -- better to add another key to the sub-hash instead.
Re: How to save memory, parsing a big file.
by salva (Canon) on Mar 02, 2006 at 05:00 UTC
    other monks have already suggested you to use some in disk database, but there are other ways to reduce memory comsumption:

    The first thing you can do is try to rewrite your algorithm to process the input as sequentially as possible, I can not help you here too much because I don't fully understand what you want to get as the final result of the processing, but if you would tell us about it...

    You can also store the data on the hash packed using vec, both for the keys and the values. The key will consume 6 bytes and the value 10 bytes plus the scalar (SV) overhead (you don't need to store $source both in the key and on the value). That would probably reduce your memory requirements to 1/10.

    Regarding the code you have posted, making the keys as "$source$dport" is ambiguos, for instance 10.0.1.10.1445 could be ....1, 445; ....14, 45; ....144, 5. Better you use $total{$source,$dport} that is equivalent to $total{"$source\0$dport"}.

    Finally, you are using as keys only the source IP and the destination port for %total and storing the target IP. That doesn't make sense to me as generally, there could be different target IPs. For instance, an user (source IP) browsing the web (port 80) would be accessing several servers (different target IPs).

Thanks everyone.
by idle (Friar) on Mar 03, 2006 at 03:16 UTC
    Thanks everyone for lighting my way.

    How simple seems thing when you know it. And as it is complex to learn this thing.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://533663]
Approved by marto
help
Sections?
Information?
Find Nodes?
Leftovers?
    Notices?
    hippoepoptai's answer Re: how do I set a cookie and redirect was blessed by hippo!
    erzuuliAnonymous Monks are no longer allowed to use Super Search, due to an excessive use of this resource by robots.