Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Re^3: Working with a very large log file (parsing data out)

by mbethke (Hermit)
on Feb 21, 2013 at 00:02 UTC ( [id://1019866]=note: print w/replies, xml ) Need Help??


in reply to Re^2: Working with a very large log file (parsing data out)
in thread Working with a very large log file (parsing data out)

True, I missed the part where he said it's an Apache log m(

I'd try and avoid making several passes over 1.5TB in Perl though. If you just accumulate request counts in a hash keyed by date as I just added above, you don't have to.

  • Comment on Re^3: Working with a very large log file (parsing data out)

Replies are listed 'Best First'.
Re^4: Working with a very large log file (parsing data out)
by jhourcle (Prior) on Feb 21, 2013 at 14:03 UTC

    The second pass is against the reduced data, not the full file. This is more complex than it needs to be, so that we try to maintain the order of whatever was seen, no matter if it can sort cleanly or not. (standard data format in webserver logs is DD/Mmm/YYYY, so if we cross months, you need a custom sort function.

    cut -d\  -f4 access_log | cut -b2-12 | uniq -c | perl -e 'my(%counts,@keys);while((my($count,$key)=(<STDIN>=~m#(\d+)\s(\d\d/\w\w\w/\d{4})#))==2){push(@keys,$key) if !$counts{$key}; $counts{$key}+=$count} print "$_\t$counts{$_}\n" foreach @keys'

    Processes a 2.5M line / 330MB access log in 6.3 seconds. If it scales linearly and I'm doing my math right, that'd be 8.4 hrs for 1.5TB.

    If the file's compressed, and you pipe through gunzip -c or similar, you might get even better times, as you'll have reduced disk IO. I ran a 2.2M line / 420MB (uncompressed) / 40MB (compressed) file in 7sec (est. 7.4 hrs for 1.5TB). If you have the processors, you could also break the file into chunks, do all of the non-perl bits in parallel on each chunk, then recombine at the end ... but then you might have to actually be able to sort to get the output in the right order.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1019866]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others exploiting the Monastery: (3)
As of 2024-04-19 20:04 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found