http://www.perlmonks.org?node_id=1020544


in reply to Working with a very large log file (parsing data out)

At my previous job, I did a *lot* of log processing. As much as I love Perl, for quick and dirty ad hoc log mangling, awk was frequently my go-to tool. For cases exactly as you describe, I used the following:

cat logfile.log | awk '{count[$4]++}; END {for (x in count) {print count[x], x}};' | sort -nr

By using an associative array (hash) to track the unique values, you reduce the amount of data you have to sort by orders of magnitude (potentially).

Note: This is not a "max performance" solution. It is a "usually fast enough" solution. If you want maximum performance, there are lots of additional things you can do to make this faster. One of the easiest things (that often pays quick dividends on modern multi-core/CPU systems) is to compress your log files. This decreases the disk IO, and for many systems will be faster than reading the whole uncompressed file from disk.

zcat logfile.log.gz | awk '{count[$4]++}; END {for (x in count) {print count[x], x}};' | sort -nr

Another possible speedup would be to do a perl-equivalent of the awk, but to stop your line split at the number of fields you care about (plus 1 for "the rest"). This will frequently be faster than the awk example, but is slightly less suitable to manually typing in every time you're hitting a log file for ad hoc log queries. Although, looking at them side-by-side, it's really not much more difficult; I think it's just the hundreds of times I typed the awk version that makes it pop quickly from my fingers.

zcat logfile.log.gz | perl -ne '@line = split " ",$_, 5; $count{$line[3]}++; END {print "$count{$_} $_ \n" for (keys %count); };'