Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Re: Using binary search to get the last 15 minutes of httpd access log

by davido (Archbishop)
on Aug 03, 2012 at 18:44 UTC ( #985307=note: print w/ replies, xml ) Need Help??


in reply to Using binary search to get the last 15 minutes of httpd access log

I'm a fan of the Binary Search (my implementation also borrows heavily from Mastering Algorithms with Perl), but I'm not sure it's the best tool for this job. You're looking for everything from 15 minutes back, through the present (end of the file). File::ReadBackwards seems like a better algorithm. On the one hand it's a linear search from the end of file to the 15 minute mark. On the other hand, even if you're using a binary search to find that 15 minute mark, you still have the linear operation of retrieving everything from the EOF to that 15 minute mark for processing.

May as well just read the file backwards, and stop when you find the record that is more than 15 minutes old. That way instead of a O(log n) plus an O(n) operation (the search, and the retrieve), you have a single O(n) operation (the retrieve, stopping when you get to the mark). But more importantly, you eliminate a ton of complexity.

Quoting from Wikipedia:

When Jon Bentley assigned it as a problem in a course for professional programmers, he found that an astounding ninety percent failed to code a binary search correctly after several hours of working on it, and another study shows that accurate code for it is only found in five out of twenty textbooks. Furthermore, Bentley's own implementation of binary search, published in his 1986 book Programming Pearls, contains an error that remained undetected for over twenty years.


Dave


Comment on Re: Using binary search to get the last 15 minutes of httpd access log
Re^2: Using binary search to get the last 15 minutes of httpd access log
by mcdave (Beadle) on Aug 03, 2012 at 21:19 UTC
    Very minor quibble, and not about Perl. But Apache creates the the timestamp from when the request is received yet writes to the file when the request is served. So, if you've got requests that take a long time to return, or lots of requests that take a very short time, you logs won't be ordered by time.

    So, if it's 12:00, and A is a long-running request that started at 11:44 and B is a short-running request, the order of events could be

    11:44:00... A received 11:45:10... B receievd 11:45:11... B served 11:46:00... A served
    and the log will read
    11:45:11 B 11:44:00 A
    in that order, so your backward search would stop at A.

    If "last 15 minutes" means "approximately last 15 minutes" and your server is relatively zippy, you're fine. I have sometimes needed 15 minutes to mean "exactly 15 minutes", though, so I happen to know this trivial about Apache.

      That is a valid point which I considered too. But this script will run on a loghost. And if I remember correctly... the timestamp will be applied to the messages as they arrive to the loghost.
        Here is the code I ended up using to search the ordered apache logs for a specific time block.
        #!/usr/bin/env perl use strict; use Search::Dict; open my $fh, "/var/log/http/access_log"; my $start = 'Oct 2 10:21:'; my $end = 'Oct 2 10:2[1-2]:'; look $fh, $start; while (my $line = <$fh>) { last if ($line !~ /$end/); print $line; }
        Also, looks like I should have done a more thorough search

      That being the case there's virtually no way to assure a slow request isn't lurking in the past without scanning back 15 minutes plus some known timeout interval. Still reading backward, plus a timeout interval is probably more reliable than doing a binary search on a file that may not be in correct order. Binary searches aren't too good with fuzzy sortedness. ;)


      Dave

Re^2: Using binary search to get the last 15 minutes of httpd access log
by mhearse (Hermit) on Aug 03, 2012 at 22:01 UTC
    Thanks for your reply. I have taken your advice and am reading the access logs from the bottom up. But I am also going to write a script which can parse a given time interval from the access logs. And I believe a binary search is the perfect solution. For my own benefit I'm going to roll my own binary search for this script. I'll post it here when it's completed.

      As I see it 'seek( $file, $mid, 0 ) || die "Couldn't seek : $!\n";' won't consistently take you to the beginning of a record. But then I'm just a clueless newbie.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://985307]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chanting in the Monastery: (8)
As of 2014-12-25 23:52 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (163 votes), past polls