|There's more than one way to do things|
search through hash for date in a rangeby bfdi533 (Friar)
|on Mar 06, 2018 at 16:51 UTC||Need Help??|
bfdi533 has asked for the
wisdom of the Perl Monks concerning the following question:
I have log files that I am processing every hour and they grow from the beginning of the day to the end of the day and typically are in the millions of lines per log by the end of the day. I have previously been doing a cmp to get just the new lines and keeping the previous file to check against. Still, I am running into duplicates and need to prevent that. As I am already keeping the start and end dates from the log in a DB, I am trying to switch to checking the dates in the log against what has already been loaded.
To do this, I am building a hash from the a DB query that stores the ID, start and end dates. The dates are stored in the hash by converting them to seconds since 1970 to make comparisons faster. However, I think the sub that checks for a date, though straight-forward, is way too slow to check millions of log lines against and run in a rapid manner. The sub in question is check_date_in_range. I thought that only calculating the array of sorted keys would speed things up so I am "caching" it but that does not appear to be the slowdown ...
Obviously, the DB code is left out and the building of the ranges_dt hash is somewhat separate and the logic to process the log file is rather complicated. But this is a working code set uses the sub so you can see what I am trying to do.
Here is the sample and complete working code.
Is there a better way to do the range check?
Something more efficient that will be very fast?
Update: I should note that the @vkeys array is just under 500 items that need to be checked for every date in question. Hence, the need to speed this up.