Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Re: Working with a very large log file (parsing data out)

by mbethke (Hermit)
on Feb 20, 2013 at 17:20 UTC ( [id://1019809]=note: print w/replies, xml ) Need Help??


in reply to Working with a very large log file (parsing data out)

++ to what MidLifeXis said. As logs tend to be sorted already, it's likely you can avoid the sort as the only part that's likely to be a problem memory-wise.

To add to that, for data this size it may be worth running a little preprocessor in C, especially if your log format has fixed-size fields or other delimiters easily recognized with C string functions. That way you could both split the parsing over two CPU cores and avoid running slow regexen (or even substr() which is fast for Perl but still doesn't even come close to C). Something like this (largely untested but you get the idea):

#include <stdlib.h> #include <stdio.h> #include <errno.h> #include <string.h> int main(int argc, char *argv[]) { char buf[10000]; FILE *fh; if(2 != argc) { fputs("Usage: filter <log>\n", stderr); exit(1); } if(!(fh = fopen(argv[1], "r"))) { perror("Cannot open log"); exit(1); } while(!fgets(buf, sizeof(buf), fh)) { static const size_t START_OFFSET = 50; size_t len = strlen(buf); char *endp; if('\n' != buf[len-1]) { fputs("WARNING: line did not fit in buffer, skipped\n", stder +r); continue; } endp = buf + START_OFFSET; len = 20; // To search for a blank after the field instead of using a fixe +d width // endp = strchr(buf + START_OFFSET, ' '); // len = endp ? endp - (buf + START_OFFSET) : len - START_OFFSE +T; // careful with strchr()==NULL fwrite(buf + START_OFFSET, 1, len, stdout); } }

Edit: jhourcle's post just reminded me of the part I missed initially, namely that it's an Apache log. So if you use the standard combined format you could just use START_OFFSET=9 and len=11 to print only the date, if you don't want to differentiate by result code. Then a simple

my %h; $h{$_}++ while(<>);
would get the requests-per-date counts and the only slightly trickier thing is to get them sorted chronologically on output. Something like
for(sort { $a->[0] <=> $b->[0] } map { [ Date::Parse::str2date($_) => +chomp ] } keys %h) { print "$_->[1]: $h{"$_\n"}\n; }

Replies are listed 'Best First'.
Re^2: Working with a very large log file (parsing data out)
by jhourcle (Prior) on Feb 20, 2013 at 20:58 UTC
    As logs tend to be sorted already, it's likely you can avoid the sort as the only part that's likely to be a problem memory-wise

    They're sorted by the time that they finish, but the time logged is when the request was made. ... so, a long running CGI or request to transfer a large file at the end of dayN might be after other lines for dayN+1

    But you still don't have to sort the whole file, as you can get everything in order, then in a second pass you sum up the values that got split up

      True, I missed the part where he said it's an Apache log m(

      I'd try and avoid making several passes over 1.5TB in Perl though. If you just accumulate request counts in a hash keyed by date as I just added above, you don't have to.

        The second pass is against the reduced data, not the full file. This is more complex than it needs to be, so that we try to maintain the order of whatever was seen, no matter if it can sort cleanly or not. (standard data format in webserver logs is DD/Mmm/YYYY, so if we cross months, you need a custom sort function.

        cut -d\  -f4 access_log | cut -b2-12 | uniq -c | perl -e 'my(%counts,@keys);while((my($count,$key)=(<STDIN>=~m#(\d+)\s(\d\d/\w\w\w/\d{4})#))==2){push(@keys,$key) if !$counts{$key}; $counts{$key}+=$count} print "$_\t$counts{$_}\n" foreach @keys'

        Processes a 2.5M line / 330MB access log in 6.3 seconds. If it scales linearly and I'm doing my math right, that'd be 8.4 hrs for 1.5TB.

        If the file's compressed, and you pipe through gunzip -c or similar, you might get even better times, as you'll have reduced disk IO. I ran a 2.2M line / 420MB (uncompressed) / 40MB (compressed) file in 7sec (est. 7.4 hrs for 1.5TB). If you have the processors, you could also break the file into chunks, do all of the non-perl bits in parallel on each chunk, then recombine at the end ... but then you might have to actually be able to sort to get the output in the right order.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1019809]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others admiring the Monastery: (7)
As of 2024-04-24 09:50 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found