Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister

Re: Working with a very large log file (parsing data out)

by topher (Scribe)
on Feb 25, 2013 at 17:05 UTC ( #1020544=note: print w/replies, xml ) Need Help??

in reply to Working with a very large log file (parsing data out)

At my previous job, I did a *lot* of log processing. As much as I love Perl, for quick and dirty ad hoc log mangling, awk was frequently my go-to tool. For cases exactly as you describe, I used the following:

cat logfile.log | awk '{count[$4]++}; END {for (x in count) {print count[x], x}};' | sort -nr

By using an associative array (hash) to track the unique values, you reduce the amount of data you have to sort by orders of magnitude (potentially).

Note: This is not a "max performance" solution. It is a "usually fast enough" solution. If you want maximum performance, there are lots of additional things you can do to make this faster. One of the easiest things (that often pays quick dividends on modern multi-core/CPU systems) is to compress your log files. This decreases the disk IO, and for many systems will be faster than reading the whole uncompressed file from disk.

zcat logfile.log.gz | awk '{count[$4]++}; END {for (x in count) {print count[x], x}};' | sort -nr

Another possible speedup would be to do a perl-equivalent of the awk, but to stop your line split at the number of fields you care about (plus 1 for "the rest"). This will frequently be faster than the awk example, but is slightly less suitable to manually typing in every time you're hitting a log file for ad hoc log queries. Although, looking at them side-by-side, it's really not much more difficult; I think it's just the hundreds of times I typed the awk version that makes it pop quickly from my fingers.

zcat logfile.log.gz | perl -ne '@line = split " ",$_, 5; $count{$line[3]}++; END {print "$count{$_} $_ \n" for (keys %count); };'

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1020544]
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others cooling their heels in the Monastery: (7)
As of 2018-06-22 02:09 GMT
Find Nodes?
    Voting Booth?
    Should cpanminus be part of the standard Perl release?

    Results (120 votes). Check out past polls.