Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

Using binary search to get the last 15 minutes of httpd access log

by mhearse (Chaplain)
on Aug 03, 2012 at 17:42 UTC ( [id://985296]=perlquestion: print w/replies, xml ) Need Help??

mhearse has asked for the wisdom of the Perl Monks concerning the following question:

I'm attempting to get the last 15 minutes of entries from a httpd access log. I immediately think of a binary search. The problem is, I've never written one. I've attempted to use this code example from the book:Mastering Algorithms with Perl. But it doesn't work for my purpose. Can someone offer help? I'm assuming the best way to search would be based on the date string: perl binary_search_file.pl "Aug 3 07:59:59"
#!/usr/bin/perl -w use strict; use integer; my ( $word, $file ) = @ARGV; open( FILE, $file ) or die "Can't open $file: $!"; my $position = binary_search_file( \*FILE, $word ); if ( defined $position ) { print "$word occurs at position $position +\n" } else { print "$word does not occur in $file.\n" +} sub binary_search_file { my ( $file, $word ) = @_; my ( $high, $low, $mid, $mid2, $line ); $low = 0; # Guaranteed to be the start of a l +ine. $high = ( stat($file) )[7]; # Might not be the start of a line. $word =~ s/\W//g; # Remove punctuation from $word. $word = lc($word); # Convert $word to lower case. while ( $high != $low ) { $mid = ( $high + $low ) / 2; seek( $file, $mid, 0 ) || die "Couldn't seek : $!\n"; # $mid is probably in the middle of a line, so read the rest # and set $mid2 to that new position. $line = <$file>; $mid2 = tell($file); if ( $mid2 < $high ) { # We're not near file's end, so rea +d on. $mid = $mid2; $line = <$file>; } else { # $mid plunked us in the last line, so linear search +. seek( $file, $low, 0 ) || die "Couldn't seek: $!\n"; while ( defined( $line = <$file> ) ) { last if compare( $line, $word ) >= 0; $low = tell($file); } last; } if ( compare( $line, $word ) < 0 ) { $low = $mid } else { $high = $mid } } return if compare( $line, $word ); return $low; } sub compare { # $word1 needs to be lowercased; $word2 doesn't. my ( $word1, $word2 ) = @_; $word1 =~ s/\W//g; $word1 = lc($word1); return $word1 cmp $word2; }

Replies are listed 'Best First'.
Re: Using binary search to get the last 15 minutes of httpd access log
by davido (Cardinal) on Aug 03, 2012 at 18:44 UTC

    I'm a fan of the Binary Search (my implementation also borrows heavily from Mastering Algorithms with Perl), but I'm not sure it's the best tool for this job. You're looking for everything from 15 minutes back, through the present (end of the file). File::ReadBackwards seems like a better algorithm. On the one hand it's a linear search from the end of file to the 15 minute mark. On the other hand, even if you're using a binary search to find that 15 minute mark, you still have the linear operation of retrieving everything from the EOF to that 15 minute mark for processing.

    May as well just read the file backwards, and stop when you find the record that is more than 15 minutes old. That way instead of a O(log n) plus an O(n) operation (the search, and the retrieve), you have a single O(n) operation (the retrieve, stopping when you get to the mark). But more importantly, you eliminate a ton of complexity.

    Quoting from Wikipedia:

    When Jon Bentley assigned it as a problem in a course for professional programmers, he found that an astounding ninety percent failed to code a binary search correctly after several hours of working on it, and another study shows that accurate code for it is only found in five out of twenty textbooks. Furthermore, Bentley's own implementation of binary search, published in his 1986 book Programming Pearls, contains an error that remained undetected for over twenty years.


    Dave

      Very minor quibble, and not about Perl. But Apache creates the the timestamp from when the request is received yet writes to the file when the request is served. So, if you've got requests that take a long time to return, or lots of requests that take a very short time, you logs won't be ordered by time.

      So, if it's 12:00, and A is a long-running request that started at 11:44 and B is a short-running request, the order of events could be

      11:44:00... A received 11:45:10... B receievd 11:45:11... B served 11:46:00... A served
      and the log will read
      11:45:11 B 11:44:00 A
      in that order, so your backward search would stop at A.

      If "last 15 minutes" means "approximately last 15 minutes" and your server is relatively zippy, you're fine. I have sometimes needed 15 minutes to mean "exactly 15 minutes", though, so I happen to know this trivial about Apache.

        That being the case there's virtually no way to assure a slow request isn't lurking in the past without scanning back 15 minutes plus some known timeout interval. Still reading backward, plus a timeout interval is probably more reliable than doing a binary search on a file that may not be in correct order. Binary searches aren't too good with fuzzy sortedness. ;)


        Dave

        That is a valid point which I considered too. But this script will run on a loghost. And if I remember correctly... the timestamp will be applied to the messages as they arrive to the loghost.
      Thanks for your reply. I have taken your advice and am reading the access logs from the bottom up. But I am also going to write a script which can parse a given time interval from the access logs. And I believe a binary search is the perfect solution. For my own benefit I'm going to roll my own binary search for this script. I'll post it here when it's completed.

        As I see it 'seek( $file, $mid, 0 ) || die "Couldn't seek : $!\n";' won't consistently take you to the beginning of a record. But then I'm just a clueless newbie.

Re: Using binary search to get the last 15 minutes of httpd access log
by flexvault (Monsignor) on Aug 04, 2012 at 09:04 UTC

    mhearse,

    If you are doing this once, then reading the whole file is fine, but if every 'nn' minutes you want to run the script then saving the location of the end-of-file would be a good solution. This technique is similar to 'pop-before-smtp' technique. (Example, not tested )

    my $File_Loc = 0; my $SaveLocFile = "./FileLoc"; if ( -e $SaveLocFile ) { open( my $IN, "<", $SaveLocFile ) or die "File Bad $!"; $File_Loc = < $IN >; chomp ( $File_Loc ); close $IN; } my $sz = ( -s $SaveLocFile ); if ($sz > 0) { open( my $log, "<", $log_file"); ## This should be log f +ile if ($sz < $File_Loc ) { $File_Loc = 0; } ## New log file seek $log, $File_Loc, 0; ## Move to old end of f +ile while ( < $log > ) { chomp; my $line = $_; # do work } $File_Loc = tell $log; ## Save current end-of-file location open( my $OUT,">", $SaveLocFile ); print $OUT "$File_Loc\n"; close $OUT; close $log; }

    Since you will be running on a log server, you should use the same locking mechanism as the log daemon. There are some race conditions in the code, but if you save where you ended, you shouldn't miss anything. I originally wrote this in 1997, so I tried to update the technique. Looking at the original code, it isn't the way I would write it today, but it has worked on some 40 - 50 mail servers since, and I never had to rewrite it (probably should now).

    Hopefully you see the idea and can adapt this to your situation.

    Good Luck

    "Well done is better than well said." - Benjamin Franklin

Re: Using binary search to get the last 15 minutes of httpd access log
by Anonymous Monk on Aug 03, 2012 at 22:50 UTC
    Aww, "to hell with it..." just read the file from the start. You'd spend more time futzing around with trying to find your place than you'd spend just reading every record and throwing out the ones you don't want. Sometimes, "brute force" is EXACTLY what the doctor ordered...
      I had to solve the same problem (for Apache logs, too) a few years back. Brute force is fine for a small log, the logs I was parsing were growing at a gigabyte+ per minute. (We rolled logs every 100 GB or 30 minutes, which ever came first.)

      Pseudo code:

      set the current size of the log (end point) seek to the mid-position (size/2, begin point) read forward from the begin-point until a timestamp is found if the timestamp is within 5 minutes of the current time, process sequentially to the end of the log and exit else reset the begin and end points and try again.
      This gimmick ran (most of the time) in under 500 milli-seconds, and gave us enough information. The Perl implementation was fast enough (most times) that we never got around to implementing it in C. You can run into problems with slow growing logs (what happens if there is only one line in the file?), and mumungous lines (again, only one line in the file and its 55MB long!). We got around it by fiat -- if something goes sour, quit; and retry again in 30 seconds. (Yahoo, Instant Messenger, three to four terabytes of logs per day....)

      ----
      I Go Back to Sleep, Now.

      OGB

Re: Using binary search to get the last 15 minutes of httpd access log
by sundialsvc4 (Abbot) on Aug 04, 2012 at 17:54 UTC

    I think that it is categorically a good idea to keep your individual files to hundreds of megabytes or maybe couple of gigabytes sizes.   Rotate them frequently on size and compress them; the logrotate daemon is great for all that.   A single, contiguous, gigantic-file is an awkward thing for both you and the operating system to handle.

    When you process the file sequentially (and especially if you can “hint” to the OS that you intend to read the file from stem to stern), the operating system is automagically going to do a lot of buffering for you.   It will take deep draughts of the file data each time it does a disk-read.   In short, the operation will be quite a bit faster than you think.

    Now that, of course, assumes that the drive is local to the machine that is doing the reading.   If data is flowing across any sort of network wire, then the situation is utterly and completely different ... such that you basically need to find a way to do the work somehow on the locally-attached machine.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://985296]
Front-paged by Arunbear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others surveying the Monastery: (4)
As of 2024-03-19 04:06 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found