relaxed137 has asked for the wisdom of the Perl Monks concerning the following question:

For some reporting, I need to parse a 500k text file. In this text file, I need to pull out the 10th field (sorted by the character | ) and then count the quantity of items in that field.

Here's my current code for doing this.

perl -n -a -F\\\| -e'END { print "$v\t$k\n" while($k,$v)=each%h} $h{$F +[9]}++' FILENAME
Output example:

( ... )

My most recent parse of this file came up with 350 unique IP addresses with anywhere from 1 to 57000 occurances.
The time to do this parse on a server I use was:

real 19m12.03s
user 17m38.42s
sys 0m7.87s

So, my question is - Can anyone out there help me make this even faster?
I know that I'm at the mercy of the CPU cycles, etc, but maybe there *is* a way to cut a few minutes off of this report.

As a note, I cannot use "cut -f10 -d| | sort | uniq -c" because the file is too big for sort to handle.

Replies are listed 'Best First'.
Re: speed up one-line "sort|uniq -c" perl code
by l2kashe (Deacon) on Apr 10, 2003 at 01:49 UTC
    In the testing I did, apparently an inline split is faster than using the -F command line arg... here's my results
    # File stats ;) code/monks> wc -l input.list 5202 input.list # my way code/monks> time perl -n -a -e'END { print "$v\t$k\n" while($k,$v)=eac +h%h} $h{(split(/\|/))[9]}++' input.list 846 943 3413 0.160u 0.000s 0:00.15 106.6% 0+0k 0+0io 339pf+0w # your way code/monks> time perl -n -a -F\\\| -e'END { print "$v\t$k\n" while($k, +$v)=each%h} $h{$F[9]}++' input.list 846 943 3413 0.310u 0.000s 0:00.30 103.3% 0+0k 0+0io 336pf+0w
    As the file got bigger the gap widened.. I started with maybe 100 lines or so, and the difference was non-existant, but once I hit 1000, my way was maybe 25% faster, at 5200 lines my way appears to be 50% faster...

    While I was posting I decided to go ahead and create a really big file just for fun.. heres the output
    # again stats... code/monks> ls -l input.list -rw-r--r-- 1 ericmc users 2477055 2003-04-09 21:42 input.list code/monks> wc -l input.list 100202 input.list # my way code/monks> time perl -n -a -e'END { print "$v\t$k\n" while($k,$v)=eac +h%h} $h{(split(/\|/))[9]}++' input.list 16046 17948 66208 2.720u 0.020s 0:02.73 100.3% 0+0k 0+0io 339pf+0w # your way.. code/monks> time perl -n -a -F\\\| -e'END { print "$v\t$k\n" while($k, +$v)=each%h} $h{$F[9]}++' input.list 16046 17948 66208 5.610u 0.010s 0:05.62 100.0% 0+0k 0+0io 336pf+0w
    So it doesnt look like it matters after a certain size, but at least it shaves that little bit extra off.. ;)

    Update: Im noticing that you aren't actually sorting anything in perl.. it could be that your time is getting chewed up however you are doing your sorting.. I would move from a one liner to say..
    #!/usr/bin/perl # update2: this is while (<>) { $f = ( split('\|') )[9]; # this needs to get optomized.. ($data{$f}->[0] = $f) =~ s/\.//g unless( $data{$f} ); $data{$f}->[1]++; } for ( sort { $data{$a}->[0] <=> $data{$b}->[0] } keys %data ) { print "$data{$_}->[1]\t$_\n"; } Sample run, Note after each run I go through and change one of the lin +es so that we arent running into OS caching. # file statscode/monks> wc -l input.list + 100202 input.list code/monks> ls -l input.list -rw-r--r-- 1 ericmc users 2410847 2003-04-09 22:11 input.list # and the first way code/monks> time perl -n -a -F\\\| -e'END { print "$v\t$k\n" while($k, +$v)=each%h} $h{$F[9]}++' input.list 16046 17948 66208 5.490u 0.000s 0:05.49 100.0% 0+0k 0+0io 336pf+0w # now for my first iteration without sorting.. code/monks> time perl -n -a -e'END { print "$v\t$k\n" while($k,$v)=eac +h%h} $h{(split(/\|/))[9]}++' input.list 17948 66208 16046 2.750u 0.000s 0:02.74 100.3% 0+0k 0+0io 339pf+0w # and now with the code from above... code/monks> time ./ input.list 66208 16046 17948 2.190u 0.010s 0:02.26 97.3% 0+0k 0+0io 349pf+0w
    YMMV and you may want to sort on the count of hits as opposed to the IP, but thats a simple replace in the sort..

    /* And the Creator, against his better judgement, wrote man.c */
Re: speed up one-line "sort|uniq -c" perl code
by tachyon (Chancellor) on Apr 10, 2003 at 01:44 UTC

    Essentially you have:

    #!/usr/bin/perl my %h; my @res; # declare outside the loop open F, $ARGV[0] or die $!; while(<F>) { @res = split /\|/; $h{$res[9]}++; } close F; print "$h{$_}\t$_\n" for keys %h;

    You may find this faster (maybe not) as we declare the @res array outside the loop which avoids creating an anonymous array to hold the split values, and then destroying it after each iteration of the loop. I suspect this may be happening in your one liner as even on a dog slow old PII 233 I have similar scripts that will parse megabyte squid/httpd log files in a few seconds - definitely not minutes.

    Alternatively slurping the file into memory and using a regex to process it in one pass may be faster it you have a major disk IO issue as the choke rather than CPU cycles. This is a spend more memory to get more speed approach. It is typically much faster to read in large chunks of data than iterate over a file linewise.

    #!/usr/bin/perl { local $/; open F, $ARGV[0] or die $!; $all = <F>; close F; @ips = $all =~ m/^(?:[^\|]+\|){9}([^\|]+)/gm; undef $all; # try to reclaim the memory } my %h; $h{$_}++ for @ips; print "$h{$_}\t$_\n" for keys %h;

    If neither of these work code it in C or get better hardware or just make a cup of coffee while it runs or nice it up to the max so it takes over the entire system.




Re: speed up one-line "sort|uniq -c" perl code
by perrin (Chancellor) on Apr 10, 2003 at 02:32 UTC
    I know this isn't really the answer you were looking for, but I think you should consider getting a better sort program. I don't think there are any limitations on the GNU one, and it is typically faster than doing the same thing in Perl.

      sort needs both time and space to perform the sort no matter how cleverly implemented. I find it hard to imagine a system that is so poorly configured that it can't handle sorting a paultry 500kB file. But I don't think that really matters in this particular case.

      There is a reason that "sort -u" came to be. It is much slower to sort all 57000 instances of several IPs and then throw all but one of each away. So I think "sort | uniq -c" would be much slower than using Perl.

      Unfortunately, it doesn't appear that even GNU sort has bothered to implement a -u option that counts the duplicates.

                      - tye
        Thanks for making me realize a typo.
        The file that I am parsing is 500MB, not 500kB....
        That's why sort freaks out.
Re: speed up one-line "sort|uniq -c" perl code
by Jasper (Chaplain) on Apr 10, 2003 at 12:57 UTC
    Just a silly optimisation, really, and I doubt one where your code is losing out greatly, but:
    print $_ , $i++ % 2 ? "\n" : "\t" for %h;

    Is a bit faster than eaching through the hash.

Re: speed up one-line "sort|uniq -c" perl code
by BrowserUk (Patriarch) on Apr 11, 2003 at 04:10 UTC

    In some fairly crude testing on a test file of 500,000 randomly generated lines (18MB) to fit the pattern of data you described, this seems to run about 400% quicker. 40 seconds rather 160.

    no warnings; $^W=0; open my $in, $ARGV[0] or die "Couldn't open $ARGV[0]:$!"; my ($buffer, %h) = ''; keys %h = 1024*500; while (sysread($in, $buffer, 16384, length $buffer)) { $h{$1}++ while $buffer =~ m[^(?:.+?\|){9}([^|]+)\|]mg; $buffer = substr($buffer, rindex($buffer, "\n")); } print scalar keys %h;

    You may need to adjust the regex, and you might squeeze some more out by playing with the read size and/or the hash preallocation. In my tests, I had variable results from both, but the differences were of within the bounds of run-to-run error. Especially the latter which I am not really sure how the number of buckets relates to keys.

    Examine what is said, not who speaks.
    1) When a distinguished but elderly scientist states that something is possible, he is almost certainly right. When he states that something is impossible, he is very probably wrong.
    2) The only way of discovering the limits of the possible is to venture a little way past them into the impossible
    3) Any sufficiently advanced technology is indistinguishable from magic.
    Arthur C. Clarke.
      this looks like it is going to end up being faster. I'm doing some testing with speed, but it looks like your numbers are going to be comparible with mine. THANKS!
Re: speed up one-line "sort|uniq -c" perl code
by maksl (Pilgrim) on Apr 14, 2003 at 19:46 UTC
    you would gain a speed by using lesser processes in your bash one liner,instead of "sort |uniq":
    sort -u
    try to use nl for the line number:
    sort -u | nl -ba
    Update: bad answer .. this question was not asked .. thx Aristotle
      Uhm, he's not at all interested in a line number, what he wants is an occurence count per word.

      Makeshifts last the longest.