Re: Memory utilization and hashes

Given all what you've said so far, especially that it seems you can't never be sure you have collected all the answers for a given query, I think I would probably go for a completely different approach.

I would use the OS's sort utility to reorganize the input file, sorting on the id number (second field). I would then read all the records for a given id number (storing them in an array or a hash), collect the information from the query record and use it to process the answer records. Once I've finished processing an id number, clear the data structures and start again with the next if number lines.

This way, the memory usage of your Perl program will be limited to the maximum number of lines there can be for one id number. (Of course, the sort phase will use a lot of memory, but the *nix sort utilities know well how to handle that, they write temporary data on disk to avoid memory overflow.)

Sorting your large file will take quite a bit of time, but at least you're guaranteed never to exceed your system's available memory.

An alternative would be to use a database, but I doubt it would be faster.

Comment on Re: Memory utilization and hashes

Replies are listed 'Best First'.
Re^2: Memory utilization and hashes by bfdi533 (Friar) on Jan 18, 2018 at 17:34 UTC
Turns out that the unix sort was exactly the prior step that was missing to help speed this up. With a correct choice of keys, the file now is in sequential order by "ID" and when a new Query comes in, it is now easy to check if the current "ID" = the prior "ID" and flush any accumulated hash entries and continue. This keeps the hash to, in testing, no more than 3-7 'extra' keys for each set of "ID"s in the file and then dumps the set. Memory usage has stayed small and the processing is now approx 1/4 the total time of the prior runs.	[reply]
Re^3: Memory utilization and hashes by poj (Abbot) on Jan 19, 2018 at 13:34 UTC
What does this sample of data you provided look like after the *nix sort ? `Query;1;host;www.example.com Answer;1;ip;1.2.3.4 Query;2;host;www.cnn.com Query;3;host;www.google.com Answer;2;ip;2.3.4.5 Answer;2;ip;2.3.4.5 Query;4;host;www.google.com Answer;4;ip;3.4.5.6 Answer;3;ip;3.4.5.6 Query;2;host;www.example2.com Answer;4;ip;1.2.4.5 Answer;2;ip;2.3.4.5` [download] poj	[reply] [d/l]
Re^4: Memory utilization and hashes by bfdi533 (Friar) on Jan 25, 2018 at 20:30 UTC
There is actually missing data in the sample data. In the real data file, it includes the date and time of the entry. Once sorted by date and ID, then I can be sure that if the date changes and the ID changes as well, then there are no more answers to be had and I can dump the data, empty the hash and move on. The real file is more like this once sorted: 2018-01-25 01:01:01;Query;1;host;www.example.com 2018-01-25 01:01:01;Answer;1;ip;1.2.3.4 2018-01-25 01:01:05;Query;2;host;www.cnn.com 2018-01-25 01:01:05;Answer;2;ip;2.3.4.5 2018-01-25 01:01:05;Answer;2;ip;2.3.4.5 2018-01-25 01:01:06;Query;3;host;www.google.com 2018-01-25 01:01:06;Answer;3;ip;3.4.5.6 2018-01-25 01:01:08;Query;4;host;www.google.com 2018-01-25 01:01:08;Answer;4;ip;3.4.5.6 2018-01-25 01:01:08;Answer;4;ip;1.2.4.5 2018-01-25 01:01:11;Query;2;host;www.example2.com 2018-01-25 01:01:11;Answer;2;ip;2.3.4.5 [download]	[reply] [d/l]
Re^3: Memory utilization and hashes by bfdi533 (Friar) on Jan 18, 2018 at 23:46 UTC
For what is is worth, and if anyone is interested, here are some stats from the processing after I introduced the *nix sort before my perl script. elapsed time \| type \|rows after\| rows before\| pct \| rows/second \| \|processing\| processing \|smaller\| 00:03:05.98667 \| dns \| 1791555 \| 4614653 \| 38.82 \| 24811.7405403301 00:03:50.106203 \| dns \| 2262736 \| 5822777 \| 38.86 \| 25304.737221708 00:04:51.91195 \| dns \| 2733705 \| 7039758 \| 38.83 \| 24116.0322487654 00:05:36.348691 \| dns \| 3208365 \| 8266995 \| 38.81 \| 24578.6447850335 00:06:33.947878 \| dns \| 3683419 \| 9490938 \| 38.81 \| 24091.8622234589 00:07:35.58667 \| dns \| 4155971 \| 10705249 \| 38.82 \| 23497.7221787459 00:08:25.086565 \| dns \| 4633553 \| 11946401 \| 38.79 \| 23652.1852447214 00:09:07.952743 \| dns \| 5109618 \| 13183845 \| 38.76 \| 24060.1861536808 00:10:16.250404 \| dns \| 5596902 \| 14441405 \| 38.76 \| 23434.3132373833 00:10:54.578348 \| dns \| 6070888 \| 15662586 \| 38.76 \| 23927.7483709253 00:11:39.012952 \| dns \| 6547181 \| 16896184 \| 38.75 \| 24171.4891714911 00:12:43.13814 \| dns \| 7019314 \| 18113219 \| 38.75 \| 23735.1772249255 00:13:34.23578 \| dns \| 7499659 \| 19365386 \| 38.73 \| 23783.5114541392 00:14:35.939246 \| dns \| 7973633 \| 20591767 \| 38.72 \| 23508.2137191967 00:15:12.223167 \| dns \| 8448494 \| 21815382 \| 38.73 \| 23914.5231004641 00:15:52.951662 \| dns \| 8923786 \| 23043433 \| 38.73 \| 24181.1142357817 00:17:45.637116 \| dns \| 9402613 \| 24278649 \| 38.73 \| 22783.2238906363 00:17:52.402055 \| dns \| 9880079 \| 25516948 \| 38.72 \| 23794.1990888856	[reply]


Syntactic Confectionery Delight
	PerlMonks