http://www.perlmonks.org?node_id=628003

dbmathis has asked for the wisdom of the Perl Monks concerning the following question:

Hi All,

I have a 1.5G IIS logfile that I am parsing and pulling out any line the contains a certain date and dumping these lines into an array.

I have written the following (very basic for demo purposes) code which takes 3 minutes to parse the file and display the array contents

if ( ! open WEBLOG, "<$weblog") { die "$0: Cannot open weblog: $weblog\n"; } while ( <WEBLOG> ) { if ( $_ =~ m/^2007-07-13/ ) { push @lines, "$_\n"; } } print @lines;
Can someone tell me if there is a quicker way of getting these lines out of the file? I am trying to keep memory usage and disk read time as low as possible.

Maybe I am just outta luck and 3 minutes is what I will have to live with. :)

Replies are listed 'Best First'.
Re: How to quickly parse a huge web log file?
by BrowserUk (Patriarch) on Jul 21, 2007 at 17:18 UTC

    9 million lines/second doesn't seem too bad to me. But you could save almost all of the memory, by just printing the matching lines as you find them, rather than pushing them to an array and printing them all at the end.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: How to quickly parse a huge web log file?
by graff (Chancellor) on Jul 21, 2007 at 18:58 UTC
    You don't know about the "grep" command-line tool? (It's written in C and comes with every kind of unix / linux / unix-tools-for-windows -- it's been around for decades.) Whatever time it takes for this to run at the command line:
    grep 2007-07-13 big_iis.log
    is going to be pretty much as fast as it can ever be done.
      You can speed that up by using fgrep instead.   :-)
        xgrep is faster still
Re: How to quickly parse a huge web log file?
by dsheroh (Monsignor) on Jul 22, 2007 at 04:44 UTC
    If you have to look at every line, then, yeah, that's probably going to be about as fast as you can get it, as others have said. 3 minutes is just how long it takes for the hard drive to read that much data and your code isn't going to be able to outrun the disk.

    But... Do you have to look at every line?

    Logfiles are generally already sorted, after all. Assuming that's true of yours, then you can stop as soon as you see the first entry from the 14th (or any date after the 13th). If you want even more of a speed boost, you can use seek to do a binary search in the file for the first entry on the 13th instead of starting at the beginning and schlepping through all the older stuff. And you might even be able to optimize the search a little more by first checking the earliest and latest dates it covers - if it's for 7/12 - 7/19, you'll probably do better to start looking somewhere around 14-15% into the file instead of at the center.

    But that's all assuming that the log is already sorted by date. If the entries are unsorted, then you pretty much have to look at every one of them and expect it to take at least 3 minutes.

      That's a great point++.

      To that end, if the file is sorted, setting $/ = \'the date'; and reading the first 'record' would move the comparison code from Perl into C, and locate the first record much more quickly. Potentially more quickly that coding a binary chop in Perl?

      After reading that first record, you just back up the seek position to the start of the line, reset $/ to "\n", and read on using normal line-by-line semantics until the end of the lines for that day and stop.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

      For some log files the time stamps aren't exactly in order, and this process might lose a few records.

      You should be able to find out of your log files suffer from this problem in a unix shell with a combination of 'cut' to extract the date, piped to 'uniq', and see if each date only shows up once.

      The issue is this -- there are a few webservers that will record the time that the request was sent, but they write to the log until after it's been sent. If you have long-running CGIs and small static documents being served from the same server, and the server is very busy, you can end up with records from one day being written out before ones from the previous day.

      ...

      In general, though, this is a great suggestion, and even if you do suffer from this issue, you'll likely only lose minimal records.

      Hi superdoc,

      I see a few people have replied to my question :). I guess I should have looked back over here earlier as I spent my entire weekend figuring out on my own what you guys suggest here.

      The logfiles that I am dealing with are in the proper sort. I am having to look line by line, but not ness. every line. What I have ended up doing is this:

      a) I get the first and last date of the logfile.
      b) I check to see if the date that I seek is closer to the beginning or end of the file.
      c) I start search from the beginning or the end based on what end of the file the date is closer to.
      d) Once I start seeing the date seeked appear in the file I start looking for the next date. Once the next date is encountered I stop looking at the rest of the file. This usually cuts processing time by 50% or more.

      To answer others questions about using grep. I have been using grep, awk and sed to do these tasks for years and they don't appear to be any faster than perl regexp.

      What is this binary search that you speak of? This might help me out alot.

      Thanks
Re: How to quickly parse a huge web log file?
by Errto (Vicar) on Jul 21, 2007 at 21:30 UTC
    I have heard rumors that using index can be faster than regexp matching when searching for fixed strings, but you'd have to test it to see for yourself.

    Update: Joost is right - I didn't see the ^ in the regexp earlier. Ignore this.