Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

Re: Memory utilization and hashes

by Laurent_R (Canon)
on Jan 17, 2018 at 23:08 UTC ( [id://1207449]=note: print w/replies, xml ) Need Help??


in reply to Memory utilization and hashes

Given all what you've said so far, especially that it seems you can't never be sure you have collected all the answers for a given query, I think I would probably go for a completely different approach.

I would use the OS's sort utility to reorganize the input file, sorting on the id number (second field). I would then read all the records for a given id number (storing them in an array or a hash), collect the information from the query record and use it to process the answer records. Once I've finished processing an id number, clear the data structures and start again with the next if number lines.

This way, the memory usage of your Perl program will be limited to the maximum number of lines there can be for one id number. (Of course, the sort phase will use a lot of memory, but the *nix sort utilities know well how to handle that, they write temporary data on disk to avoid memory overflow.)

Sorting your large file will take quite a bit of time, but at least you're guaranteed never to exceed your system's available memory.

An alternative would be to use a database, but I doubt it would be faster.

Replies are listed 'Best First'.
Re^2: Memory utilization and hashes
by bfdi533 (Friar) on Jan 18, 2018 at 17:34 UTC

    Turns out that the unix sort was exactly the prior step that was missing to help speed this up. With a correct choice of keys, the file now is in sequential order by "ID" and when a new Query comes in, it is now easy to check if the current "ID" = the prior "ID" and flush any accumulated hash entries and continue. This keeps the hash to, in testing, no more than 3-7 'extra' keys for each set of "ID"s in the file and then dumps the set.

    Memory usage has stayed small and the processing is now approx 1/4 the total time of the prior runs.

      What does this sample of data you provided look like after the *nix sort ?

      Query;1;host;www.example.com Answer;1;ip;1.2.3.4 Query;2;host;www.cnn.com Query;3;host;www.google.com Answer;2;ip;2.3.4.5 Answer;2;ip;2.3.4.5 Query;4;host;www.google.com Answer;4;ip;3.4.5.6 Answer;3;ip;3.4.5.6 Query;2;host;www.example2.com Answer;4;ip;1.2.4.5 Answer;2;ip;2.3.4.5
      poj

        There is actually missing data in the sample data. In the real data file, it includes the date and time of the entry.

        Once sorted by date and ID, then I can be sure that if the date changes and the ID changes as well, then there are no more answers to be had and I can dump the data, empty the hash and move on.

        The real file is more like this once sorted:

        2018-01-25 01:01:01;Query;1;host;www.example.com 2018-01-25 01:01:01;Answer;1;ip;1.2.3.4 2018-01-25 01:01:05;Query;2;host;www.cnn.com 2018-01-25 01:01:05;Answer;2;ip;2.3.4.5 2018-01-25 01:01:05;Answer;2;ip;2.3.4.5 2018-01-25 01:01:06;Query;3;host;www.google.com 2018-01-25 01:01:06;Answer;3;ip;3.4.5.6 2018-01-25 01:01:08;Query;4;host;www.google.com 2018-01-25 01:01:08;Answer;4;ip;3.4.5.6 2018-01-25 01:01:08;Answer;4;ip;1.2.4.5 2018-01-25 01:01:11;Query;2;host;www.example2.com 2018-01-25 01:01:11;Answer;2;ip;2.3.4.5

      For what is is worth, and if anyone is interested, here are some stats from the processing after I introduced the *nix sort before my perl script.

       elapsed time    | type      |rows after| rows before| pct   | rows/second 
                       |           |processing| processing |smaller| 
       00:03:05.98667  | dns       |  1791555 |    4614653 | 38.82 | 24811.7405403301
       00:03:50.106203 | dns       |  2262736 |    5822777 | 38.86 |  25304.737221708
       00:04:51.91195  | dns       |  2733705 |    7039758 | 38.83 | 24116.0322487654
       00:05:36.348691 | dns       |  3208365 |    8266995 | 38.81 | 24578.6447850335
       00:06:33.947878 | dns       |  3683419 |    9490938 | 38.81 | 24091.8622234589
       00:07:35.58667  | dns       |  4155971 |   10705249 | 38.82 | 23497.7221787459
       00:08:25.086565 | dns       |  4633553 |   11946401 | 38.79 | 23652.1852447214
       00:09:07.952743 | dns       |  5109618 |   13183845 | 38.76 | 24060.1861536808
       00:10:16.250404 | dns       |  5596902 |   14441405 | 38.76 | 23434.3132373833
       00:10:54.578348 | dns       |  6070888 |   15662586 | 38.76 | 23927.7483709253
       00:11:39.012952 | dns       |  6547181 |   16896184 | 38.75 | 24171.4891714911
       00:12:43.13814  | dns       |  7019314 |   18113219 | 38.75 | 23735.1772249255
       00:13:34.23578  | dns       |  7499659 |   19365386 | 38.73 | 23783.5114541392
       00:14:35.939246 | dns       |  7973633 |   20591767 | 38.72 | 23508.2137191967
       00:15:12.223167 | dns       |  8448494 |   21815382 | 38.73 | 23914.5231004641
       00:15:52.951662 | dns       |  8923786 |   23043433 | 38.73 | 24181.1142357817
       00:17:45.637116 | dns       |  9402613 |   24278649 | 38.73 | 22783.2238906363
       00:17:52.402055 | dns       |  9880079 |   25516948 | 38.72 | 23794.1990888856
      
      

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1207449]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (4)
As of 2024-04-25 20:49 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found