note
topher
<p>At my previous job, I did a *lot* of log processing. As much as I love Perl, for quick and dirty ad hoc log mangling, awk was frequently my go-to tool. For cases exactly as you describe, I used the following:</p>
<p><code>cat logfile.log | awk '{count[$4]++}; END {for (x in count) {print count[x], x}};' | sort -nr</code></p>
<p>By using an associative array (hash) to track the unique values, you reduce the amount of data you have to sort by orders of magnitude (potentially).</p>
<p><em><strong>Note:</strong> This is not a "max performance" solution. It is a "usually fast enough" solution. If you want maximum performance, there are <strong>lots</strong> of additional things you can do to make this faster. One of the easiest things (that often pays quick dividends on modern multi-core/CPU systems) is to compress your log files. This decreases the disk IO, and for many systems will be faster than reading the whole uncompressed file from disk.</em></p>
<p><code>zcat logfile.log.gz | awk '{count[$4]++}; END {for (x in count) {print count[x], x}};' | sort -nr</code></p>
<p><em>Another possible speedup would be to do a perl-equivalent of the awk, but to stop your line split at the number of fields you care about (plus 1 for "the rest"). This will frequently be faster than the awk example, but is slightly less suitable to manually typing in every time you're hitting a log file for ad hoc log queries. Although, looking at them side-by-side, it's really not much more difficult; I think it's just the hundreds of times I typed the awk version that makes it pop quickly from my fingers.</em></p>
<code>zcat logfile.log.gz | perl -ne '@line = split " ",$_, 5; $count{$line[3]}++; END {print "$count{$_} $_ \n" for (keys %count); };'</code>
1019724
1019724