Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things

search through hash for date in a range

by bfdi533 (Friar)
on Mar 06, 2018 at 16:51 UTC ( #1210417=perlquestion: print w/replies, xml ) Need Help??
bfdi533 has asked for the wisdom of the Perl Monks concerning the following question:

I have log files that I am processing every hour and they grow from the beginning of the day to the end of the day and typically are in the millions of lines per log by the end of the day. I have previously been doing a cmp to get just the new lines and keeping the previous file to check against. Still, I am running into duplicates and need to prevent that. As I am already keeping the start and end dates from the log in a DB, I am trying to switch to checking the dates in the log against what has already been loaded.

To do this, I am building a hash from the a DB query that stores the ID, start and end dates. The dates are stored in the hash by converting them to seconds since 1970 to make comparisons faster. However, I think the sub that checks for a date, though straight-forward, is way too slow to check millions of log lines against and run in a rapid manner. The sub in question is check_date_in_range. I thought that only calculating the array of sorted keys would speed things up so I am "caching" it but that does not appear to be the slowdown ...

Obviously, the DB code is left out and the building of the ranges_dt hash is somewhat separate and the logic to process the log file is rather complicated. But this is a working code set uses the sub so you can see what I am trying to do.

Here is the sample and complete working code.

#!/usr/bin/env perl use Date::Manip::Date; use warnings; use strict; $|++; my @vkeys; my $dmd = new Date::Manip::Date; my $cmp_dt = new Date::Manip::Date; my %ranges_dt; # ================================================== sub check_date_in_range { my ($value, %h) = @_; $value =~ s/-/ /; $cmp_dt->parse($value); my $cvalue = $cmp_dt->secs_since_1970_GMT; if (!@vkeys) { @vkeys = sort keys %h; } foreach my $k (@vkeys) { if (($cvalue >= $h{$k}{start}) && ($cvalue <= $h{$k}{end})) { return $k; } } return; } while (<DATA>) { chomp; if (s/^(\w+)://) { my $cat = $1; if ($cat eq "R") { my ($i, $s, $e) = split ','; $dmd->parse($s); $ranges_dt{$i}{start} = $dmd->secs_since_1970_GMT; $dmd->parse($e); $ranges_dt{$i}{end} = $dmd->secs_since_1970_GMT; } else { my $rangeid = check_date_in_range($_, %ranges_dt); if (!defined $rangeid) { print "No range found for $_\n"; } else { print "Found range: $rangeid\n"; } } } } __DATA__ R:1,2018-03-06 14:20:00,2018-03-06 14:30:00 R:2,2018-03-06 13:00:00,2018-03-06 13:40:00 R:3,2018-03-06 13:45:00,2018-03-06 13:50:00 D:03/06/2018-14:29:41 D:03/06/2018-13:33:38 D:03/06/2018-13:54:47 D:03/06/2018-12:53:34 D:03/06/2018-13:29:19 D:03/06/2018-12:52:47 D:03/06/2018-14:21:51 D:03/06/2018-13:49:20 D:03/06/2018-13:36:18 D:03/06/2018-13:44:25

Is there a better way to do the range check?

Something more efficient that will be very fast?

Update: I should note that the @vkeys array is just under 500 items that need to be checked for every date in question. Hence, the need to speed this up.

Replies are listed 'Best First'.
Re: search through hash for date in a range
by choroba (Bishop) on Mar 06, 2018 at 17:09 UTC
    If the timestamp formats are always the same and governed by the category, you can use a simpler module to handle the conversions and specify the formats explicitly:
    use Time::Piece; my @vkeys; my %ranges_dt; # ================================================== sub check_date_in_range { my ($value, %h) = @_; my $cvalue = Time::Piece->strptime($value, '%m/%d/%Y-%H:%M:%S')->e +poch; if (!@vkeys) { @vkeys = sort keys %h; } foreach my $k (@vkeys) { if (($cvalue >= $h{$k}{start}) && ($cvalue <= $h{$k}{end})) { return $k; } } return; } while (<DATA>) { chomp; if (s/^(\w+)://) { my $cat = $1; if ($cat eq "R") { my ($i, $s, $e) = split ','; $ranges_dt{$i}{start} = Time::Piece->strptime($s, '%Y-%m-% +d %H:%M:%S')->epoch; $ranges_dt{$i}{end} = Time::Piece->strptime($e, '%Y-%m-%d +%H:%M:%S')->epoch; } else { my $rangeid = check_date_in_range($_, %ranges_dt); if (!defined $rangeid) { print "No range found for $_\n"; } else { print "Found range: $rangeid\n"; } } } }

    Another option would be not to store the timestamps in a hash and sort them every time a new one arrives, but use a data structure that can add a new element and quickly search for the closest members (binary search tree, for example).

    ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,

      Boy, there is a huge difference in performance between Date::Manip::Date and Time::Piece ...

      [user@host newload]$ ./ 500 Benchmark: timing 500 iterations of Date::Manip::Date, Time::Piece... Date::Manip::Date: 124 wallclock secs (119.35 usr + 1.30 sys = 120.65 + CPU) @ 4.14/s (n=500) Time::Piece: 5 wallclock secs ( 3.88 usr + 0.03 sys = 3.91 CPU) @ 1 +27.88/s (n=500)

      Certainly a code change I will be making but that leaves the foreach loop through all available ranges as a massive performance hit still ...

      I assume your recommendation for Time::Piece is due to better performance?

      Otherwise, not sure why I would change libraries from something I know works.

      As to sorting the keys each time, I already reduced that workload to a cached array with the code:

      if (!@vkeys) { @vkeys = sort keys %h; }

      Only one sort is being done the first time the sub is run ...

Re: search through hash for date in a range
by hippo (Canon) on Mar 06, 2018 at 17:19 UTC

    Alternatively, why not just store the line number of the last line processed between runs? It will be much faster to just discard the first N lines on the next run than to do date-based processing on them.

      That is a good question.

      The reason is that the logs are uploaded to a central server for processing and are contained in a .tgz file. Once opened up, there is no guarantee that each hour being processed by the loader script is in sequential order and the processing of the log files is done in parallel in a queue with work handed out to child processing scripts. So, the DB is the only point of reference. Also, storing a byte count would not work when the files are rotated (supposedly once a day) and the file starts over at 0 (but on some systems the logrotate does not always happen so those files just keep growing).

      It could be that hour 3, 5 and 6 are processed and THEN hour 4 gets processed so again the byte count from hour 6 will not be applicable to hour 4. But with hour 4 having been included in the hour 5 and hour 6 upload, there is no need to actually pull out and load the data but without checking the dates for inclusion in the range, there is no way to guarantee this.

      Given all of this, dates are the only reliable means of knowing data has been processed before.

      But, even with being able to process only dates after a byte skip, I would still need to process the dates in the range. And there are a lot of them each hour on some of the busier systems.

      Hence, the original question, how to make the lookup faster/more efficient?

        Can the ranges overlap each other ? In the 3 shown as example they don't.

Re: search through hash for date in a range
by Laurent_R (Canon) on Mar 07, 2018 at 07:32 UTC
    I'm not sure if this applies in your specific case, but when dealing with ranges, I found that using a hash is usually not very efficient because you usually end up making sequential search through the hash keys, thereby annihilating the advantage of fast hash lookup. Using an array of sorted ranges and a binary search is often much faster in such cases.
Re: search through hash for date in a range
by Anonymous Monk on Mar 07, 2018 at 12:54 UTC
    Also, I can't really believe that this I/O bound procedure would in any way be impacted by the number of nanoseconds required to manipulate the dates.
      Also, I can't really believe that this I/O bound procedure would in any way be impacted by the number of nanoseconds required to manipulate the dates.

      Never underestimate the impact of a large number of subroutine calls.

      Given the OP's followup, it seems there's a lot going on in the original code.

      Quantum Mechanics: The dreams stuff is made of

        Yes, there is a lot going on. Details on the problem and discussion on solutions are in an earlier post from me:

        But, introducing the has range lookup was Terrible for the performance as it meant I had to call a sub, search through about 500 hash items, return if found; AND it had to do this for between 250,000 and 500,000 items each run. If you look at my stats output, you will see this was taking about (edited numbers to reflect actual data) an average of 120 seconds per 1000.

        Changing to the array lookup was a HUGE increase in speed knocking the per 1000 time to 5-7 seconds.

        With all of the other stuff going on, it is a performance nightmare ... Every little bit counts.

Re: search through hash for date in a range
by pwagyi (Beadle) on Mar 23, 2018 at 03:24 UTC

    Is %ranges_dt (reference data from DB) once loaded unchanged throughout processing log data? It doesn't seem to be in the code.

Re: search through hash for date in a range
by sundialsvc4 (Abbot) on Mar 07, 2018 at 17:39 UTC

    I’m sorry that I don’t right now have time to flesh this brainstorm out with a full code example, but it should be obvious ...

    A very good approach would use a data structure such as Tree::RB to store entries, keyed by start-time.   Each node is a 2-tuple hash which contains the start and the end time.   (As you initially populate the tree from the database, you specifically look for overlaps and remove them, replacing them with the longer ranges so that the tree finally contains only the longest possible ranges.)

    N.B. a general algorithm for “overlap” is simply:

     ($node->start_value <= $end_value_of_interest) && ($node->end_value >= $start_value_of_interest)

    (Use any sort of convenient “orderable-number representation of a time/date” that is helpfully provided by your Perl time-piece of choice.)

    I fingered this particular CPAN module because its lookup() method contains a mode parameter which supports inexact lookups.   The specific mode that I am looking for is called LULTEQ:

    Returns the node exactly matching the specified key.   If this is not found then the next node that is less than the specified key is returned.

    Retrieve this key, whatever it is, and then check to be sure that the time-range that you are looking at falls entirely within – or, does not intersect? – the time-range represented by that node.

    As a further simple optimization, then stash that object-reference that you last retrieved ... whether or not it was successful ... and check it first.   This will avoid tree-lookups in a majority of cases because log-file entries are consecutive.

      While considering this suggestion further, I would point out that the process for collapsing nodes so that the tree includes only the longest possible ranges needs to carefully consider both sides.   Specifically:

      • First, search for the node which contains the largest low-value that is less than or equal to the specified key – using the inexact-search as noted previously.
      • If the current date-range overlaps this one, add this key to a list of nodes that will momentarily need to be removed, and note what the new date-range will be ... so far.
      • Now, use iter(high_value) to create an iterator that will allow you to walk forward from that high-value key.   Check each node to see if it also overlaps the new range. If so, expand the range and (to avoid disturbing the iterator) push these keys onto the bucket-list as well.
      • Once the iterator no longer produces overlapping ranges, run the bucket-list to remove all of the now-obsolete keys (a step that we deferred to avoid impact to the iterator), and insert one node containing the expanded range.   The red/black tree algorithm will efficiently keep the tree balanced.   (Notice that you cannot update the nodes in-place.)

      Now, all of this is simply describing an implementation of a “range-set” type.   I surfed CPAN for that alternative first but did not readily find one.   It would, however, be an excellent candidate for CPAN inclusion.   Did anyone find the package that I was looking for?

      A second alternative worth considering in cases like this is MRU – maintain a simple array of ranges, search it sequentially, and then remove the range that you found from the list and shift it onto the front of the list so that the most-recently found entries will be found first.   But the OP indicated that his situation was extremely performance-critical, hence the suggestion of a slightly more exotic data structure to drive the search in O(log2) time.   The actual frequency-distribution of the date values in the log-data would drive this decision:   is the right-answer likely to be, simply, most-recently found?

        All these words but no code. This post could have been probably cut in half size-wise with even a commented pseudo code implementation of whatever you're going on about.

        Three thousand years of beautiful tradition, from Moses to Sandy Koufax, you're god damn right I'm living in the fucking past

        "While considering this suggestion further... alternative worth considering in cases..."

        Ja, ja. But where is some god damn working code (no pseudo, no wrong comments etc.)? Just real plain Perl. Something that compiles and works, you know...

        «The Crux of the Biscuit is the Apostrophe»

        perl -MCrypt::CBC -E 'say Crypt::CBC->new(-key=>'kgb',-cipher=>"Blowfish")->decrypt_hex($ENV{KARL});'Help

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1210417]
Front-paged by stevieb
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others imbibing at the Monastery: (8)
As of 2018-06-21 10:10 GMT
Find Nodes?
    Voting Booth?
    Should cpanminus be part of the standard Perl release?

    Results (118 votes). Check out past polls.