Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change

search through hash for date in a range

by bfdi533 (Friar)
on Mar 06, 2018 at 11:51 UTC ( #1210417=perlquestion: print w/replies, xml ) Need Help??
bfdi533 has asked for the wisdom of the Perl Monks concerning the following question:

I have log files that I am processing every hour and they grow from the beginning of the day to the end of the day and typically are in the millions of lines per log by the end of the day. I have previously been doing a cmp to get just the new lines and keeping the previous file to check against. Still, I am running into duplicates and need to prevent that. As I am already keeping the start and end dates from the log in a DB, I am trying to switch to checking the dates in the log against what has already been loaded.

To do this, I am building a hash from the a DB query that stores the ID, start and end dates. The dates are stored in the hash by converting them to seconds since 1970 to make comparisons faster. However, I think the sub that checks for a date, though straight-forward, is way too slow to check millions of log lines against and run in a rapid manner. The sub in question is check_date_in_range. I thought that only calculating the array of sorted keys would speed things up so I am "caching" it but that does not appear to be the slowdown ...

Obviously, the DB code is left out and the building of the ranges_dt hash is somewhat separate and the logic to process the log file is rather complicated. But this is a working code set uses the sub so you can see what I am trying to do.

Here is the sample and complete working code.

#!/usr/bin/env perl use Date::Manip::Date; use warnings; use strict; $|++; my @vkeys; my $dmd = new Date::Manip::Date; my $cmp_dt = new Date::Manip::Date; my %ranges_dt; # ================================================== sub check_date_in_range { my ($value, %h) = @_; $value =~ s/-/ /; $cmp_dt->parse($value); my $cvalue = $cmp_dt->secs_since_1970_GMT; if (!@vkeys) { @vkeys = sort keys %h; } foreach my $k (@vkeys) { if (($cvalue >= $h{$k}{start}) && ($cvalue <= $h{$k}{end})) { return $k; } } return; } while (<DATA>) { chomp; if (s/^(\w+)://) { my $cat = $1; if ($cat eq "R") { my ($i, $s, $e) = split ','; $dmd->parse($s); $ranges_dt{$i}{start} = $dmd->secs_since_1970_GMT; $dmd->parse($e); $ranges_dt{$i}{end} = $dmd->secs_since_1970_GMT; } else { my $rangeid = check_date_in_range($_, %ranges_dt); if (!defined $rangeid) { print "No range found for $_\n"; } else { print "Found range: $rangeid\n"; } } } } __DATA__ R:1,2018-03-06 14:20:00,2018-03-06 14:30:00 R:2,2018-03-06 13:00:00,2018-03-06 13:40:00 R:3,2018-03-06 13:45:00,2018-03-06 13:50:00 D:03/06/2018-14:29:41 D:03/06/2018-13:33:38 D:03/06/2018-13:54:47 D:03/06/2018-12:53:34 D:03/06/2018-13:29:19 D:03/06/2018-12:52:47 D:03/06/2018-14:21:51 D:03/06/2018-13:49:20 D:03/06/2018-13:36:18 D:03/06/2018-13:44:25

Is there a better way to do the range check?

Something more efficient that will be very fast?

Update: I should note that the @vkeys array is just under 500 items that need to be checked for every date in question. Hence, the need to speed this up.

Replies are listed 'Best First'.
Re: search through hash for date in a range
by choroba (Bishop) on Mar 06, 2018 at 12:09 UTC
    If the timestamp formats are always the same and governed by the category, you can use a simpler module to handle the conversions and specify the formats explicitly:
    use Time::Piece; my @vkeys; my %ranges_dt; # ================================================== sub check_date_in_range { my ($value, %h) = @_; my $cvalue = Time::Piece->strptime($value, '%m/%d/%Y-%H:%M:%S')->e +poch; if (!@vkeys) { @vkeys = sort keys %h; } foreach my $k (@vkeys) { if (($cvalue >= $h{$k}{start}) && ($cvalue <= $h{$k}{end})) { return $k; } } return; } while (<DATA>) { chomp; if (s/^(\w+)://) { my $cat = $1; if ($cat eq "R") { my ($i, $s, $e) = split ','; $ranges_dt{$i}{start} = Time::Piece->strptime($s, '%Y-%m-% +d %H:%M:%S')->epoch; $ranges_dt{$i}{end} = Time::Piece->strptime($e, '%Y-%m-%d +%H:%M:%S')->epoch; } else { my $rangeid = check_date_in_range($_, %ranges_dt); if (!defined $rangeid) { print "No range found for $_\n"; } else { print "Found range: $rangeid\n"; } } } }

    Another option would be not to store the timestamps in a hash and sort them every time a new one arrives, but use a data structure that can add a new element and quickly search for the closest members (binary search tree, for example).

    ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,

      Boy, there is a huge difference in performance between Date::Manip::Date and Time::Piece ...

      [user@host newload]$ ./ 500 Benchmark: timing 500 iterations of Date::Manip::Date, Time::Piece... Date::Manip::Date: 124 wallclock secs (119.35 usr + 1.30 sys = 120.65 + CPU) @ 4.14/s (n=500) Time::Piece: 5 wallclock secs ( 3.88 usr + 0.03 sys = 3.91 CPU) @ 1 +27.88/s (n=500)

      Certainly a code change I will be making but that leaves the foreach loop through all available ranges as a massive performance hit still ...

      I assume your recommendation for Time::Piece is due to better performance?

      Otherwise, not sure why I would change libraries from something I know works.

      As to sorting the keys each time, I already reduced that workload to a cached array with the code:

      if (!@vkeys) { @vkeys = sort keys %h; }

      Only one sort is being done the first time the sub is run ...

Re: search through hash for date in a range
by hippo (Canon) on Mar 06, 2018 at 12:19 UTC

    Alternatively, why not just store the line number of the last line processed between runs? It will be much faster to just discard the first N lines on the next run than to do date-based processing on them.

      That is a good question.

      The reason is that the logs are uploaded to a central server for processing and are contained in a .tgz file. Once opened up, there is no guarantee that each hour being processed by the loader script is in sequential order and the processing of the log files is done in parallel in a queue with work handed out to child processing scripts. So, the DB is the only point of reference. Also, storing a byte count would not work when the files are rotated (supposedly once a day) and the file starts over at 0 (but on some systems the logrotate does not always happen so those files just keep growing).

      It could be that hour 3, 5 and 6 are processed and THEN hour 4 gets processed so again the byte count from hour 6 will not be applicable to hour 4. But with hour 4 having been included in the hour 5 and hour 6 upload, there is no need to actually pull out and load the data but without checking the dates for inclusion in the range, there is no way to guarantee this.

      Given all of this, dates are the only reliable means of knowing data has been processed before.

      But, even with being able to process only dates after a byte skip, I would still need to process the dates in the range. And there are a lot of them each hour on some of the busier systems.

      Hence, the original question, how to make the lookup faster/more efficient?

        Can the ranges overlap each other ? In the 3 shown as example they don't.

Re: search through hash for date in a range
by Laurent_R (Canon) on Mar 07, 2018 at 02:32 UTC
    I'm not sure if this applies in your specific case, but when dealing with ranges, I found that using a hash is usually not very efficient because you usually end up making sequential search through the hash keys, thereby annihilating the advantage of fast hash lookup. Using an array of sorted ranges and a binary search is often much faster in such cases.
Re: search through hash for date in a range
by Anonymous Monk on Mar 07, 2018 at 07:54 UTC
    Also, I can't really believe that this I/O bound procedure would in any way be impacted by the number of nanoseconds required to manipulate the dates.
      Also, I can't really believe that this I/O bound procedure would in any way be impacted by the number of nanoseconds required to manipulate the dates.

      Never underestimate the impact of a large number of subroutine calls.

      Given the OP's followup, it seems there's a lot going on in the original code.

      Quantum Mechanics: The dreams stuff is made of

        Yes, there is a lot going on. Details on the problem and discussion on solutions are in an earlier post from me:

        But, introducing the has range lookup was Terrible for the performance as it meant I had to call a sub, search through about 500 hash items, return if found; AND it had to do this for between 250,000 and 500,000 items each run. If you look at my stats output, you will see this was taking about (edited numbers to reflect actual data) an average of 120 seconds per 1000.

        Changing to the array lookup was a HUGE increase in speed knocking the per 1000 time to 5-7 seconds.

        With all of the other stuff going on, it is a performance nightmare ... Every little bit counts.

Re: search through hash for date in a range
by pwagyi (Scribe) on Mar 22, 2018 at 23:24 UTC

    Is %ranges_dt (reference data from DB) once loaded unchanged throughout processing log data? It doesn't seem to be in the code.

    A reply falls below the community's threshold of quality. You may see it by logging in.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1210417]
Front-paged by stevieb
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (4)
As of 2018-12-10 01:07 GMT
Find Nodes?
    Voting Booth?
    How many stories does it take before you've heard them all?

    Results (47 votes). Check out past polls.

    • (Sep 10, 2018 at 18:53 UTC) Welcome new users!