How to quickly parse a huge web log file?

dbmathis has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: How to quickly parse a huge web log file? by BrowserUk (Patriarch) on Jul 21, 2007 at 17:18 UTC
9 million lines/second doesn't seem too bad to me. But you could save almost all of the memory, by just printing the matching lines as you find them, rather than pushing them to an array and printing them all at the end. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply]
Re: How to quickly parse a huge web log file? by graff (Chancellor) on Jul 21, 2007 at 18:58 UTC
You don't know about the "grep" command-line tool? (It's written in C and comes with every kind of unix / linux / unix-tools-for-windows -- it's been around for decades.) Whatever time it takes for this to run at the command line: `grep 2007-07-13 big_iis.log` [download] is going to be pretty much as fast as it can ever be done.	[reply] [d/l]
Re^2: How to quickly parse a huge web log file? by jwkrahn (Abbot) on Jul 21, 2007 at 19:32 UTC
You can speed that up by using fgrep instead. `:-)`	[reply] [d/l]
Re^3: How to quickly parse a huge web log file? by Anonymous Monk on Jul 22, 2007 at 04:15 UTC
xgrep is faster still	[reply]
Re: How to quickly parse a huge web log file? by dsheroh (Monsignor) on Jul 22, 2007 at 04:44 UTC
If you have to look at every line, then, yeah, that's probably going to be about as fast as you can get it, as others have said. 3 minutes is just how long it takes for the hard drive to read that much data and your code isn't going to be able to outrun the disk. But... Do you have to look at every line? Logfiles are generally already sorted, after all. Assuming that's true of yours, then you can stop as soon as you see the first entry from the 14th (or any date after the 13th). If you want even more of a speed boost, you can use `seek` to do a binary search in the file for the first entry on the 13th instead of starting at the beginning and schlepping through all the older stuff. And you might even be able to optimize the search a little more by first checking the earliest and latest dates it covers - if it's for 7/12 - 7/19, you'll probably do better to start looking somewhere around 14-15% into the file instead of at the center. But that's all assuming that the log is already sorted by date. If the entries are unsorted, then you pretty much have to look at every one of them and expect it to take at least 3 minutes.	[reply] [d/l]
Re^2: How to quickly parse a huge web log file? by BrowserUk (Patriarch) on Jul 22, 2007 at 05:02 UTC
That's a great point++. To that end, if the file is sorted, setting `$/ = \'the date';` and reading the first 'record' would move the comparison code from Perl into C, and locate the first record much more quickly. Potentially more quickly that coding a binary chop in Perl? After reading that first record, you just back up the seek position to the start of the line, reset `$/` to "\n", and read on using normal line-by-line semantics until the end of the lines for that day and stop. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply] [d/l] [select]
Re^2: How to quickly parse a huge web log file? by jhourcle (Prior) on Jul 22, 2007 at 06:58 UTC
For some log files the time stamps aren't exactly in order, and this process might lose a few records. You should be able to find out of your log files suffer from this problem in a unix shell with a combination of 'cut' to extract the date, piped to 'uniq', and see if each date only shows up once. The issue is this -- there are a few webservers that will record the time that the request was sent, but they write to the log until after it's been sent. If you have long-running CGIs and small static documents being served from the same server, and the server is very busy, you can end up with records from one day being written out before ones from the previous day. ... In general, though, this is a great suggestion, and even if you do suffer from this issue, you'll likely only lose minimal records.	[reply]
Re^2: How to quickly parse a huge web log file? by dbmathis (Scribe) on Jul 23, 2007 at 13:50 UTC
Hi superdoc, I see a few people have replied to my question :). I guess I should have looked back over here earlier as I spent my entire weekend figuring out on my own what you guys suggest here. The logfiles that I am dealing with are in the proper sort. I am having to look line by line, but not ness. every line. What I have ended up doing is this: a) I get the first and last date of the logfile. b) I check to see if the date that I seek is closer to the beginning or end of the file. c) I start search from the beginning or the end based on what end of the file the date is closer to. d) Once I start seeing the date seeked appear in the file I start looking for the next date. Once the next date is encountered I stop looking at the rest of the file. This usually cuts processing time by 50% or more. To answer others questions about using grep. I have been using grep, awk and sed to do these tasks for years and they don't appear to be any faster than perl regexp. What is this binary search that you speak of? This might help me out alot. Thanks	[reply]
Re^3: How to quickly parse a huge web log file? by Corion (Patriarch) on Jul 23, 2007 at 13:58 UTC
Binary Search	[reply]
Re: How to quickly parse a huge web log file? by Errto (Vicar) on Jul 21, 2007 at 21:30 UTC
I have heard rumors that using index can be faster than regexp matching when searching for fixed strings, but you'd have to test it to see for yourself. Update: Joost is right - I didn't see the ^ in the regexp earlier. Ignore this.	[reply]
Re^2: How to quickly parse a huge web log file? by Joost (Canon) on Jul 21, 2007 at 21:41 UTC
index() is probably not faster if you only want to test the very beginning of a string. doing a `substr($input,0,$length_of_match) eq $match` just might be a little faster, though. "What should it profit a man, if he should win a flame war, yet lose his cool?"	[reply] [d/l]


go ahead... be a heretic
	PerlMonks