Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Re: speed up one-line "sort|uniq -c" perl code

by l2kashe (Deacon)
on Apr 10, 2003 at 01:49 UTC ( [id://249515]=note: print w/replies, xml ) Need Help??


in reply to speed up one-line "sort|uniq -c" perl code

In the testing I did, apparently an inline split is faster than using the -F command line arg... here's my results
# File stats ;) code/monks> wc -l input.list 5202 input.list # my way code/monks> time perl -n -a -e'END { print "$v\t$k\n" while($k,$v)=eac +h%h} $h{(split(/\|/))[9]}++' input.list 846 10.155.240.2 943 10.155.240.3 3413 10.155.240.1 0.160u 0.000s 0:00.15 106.6% 0+0k 0+0io 339pf+0w # your way code/monks> time perl -n -a -F\\\| -e'END { print "$v\t$k\n" while($k, +$v)=each%h} $h{$F[9]}++' input.list 846 10.155.240.2 943 10.155.240.3 3413 10.155.240.1 0.310u 0.000s 0:00.30 103.3% 0+0k 0+0io 336pf+0w
As the file got bigger the gap widened.. I started with maybe 100 lines or so, and the difference was non-existant, but once I hit 1000, my way was maybe 25% faster, at 5200 lines my way appears to be 50% faster...

While I was posting I decided to go ahead and create a really big file just for fun.. heres the output
# again stats... code/monks> ls -l input.list -rw-r--r-- 1 ericmc users 2477055 2003-04-09 21:42 input.list code/monks> wc -l input.list 100202 input.list # my way code/monks> time perl -n -a -e'END { print "$v\t$k\n" while($k,$v)=eac +h%h} $h{(split(/\|/))[9]}++' input.list 16046 10.155.240.2 17948 10.155.240.3 66208 10.155.240.1 2.720u 0.020s 0:02.73 100.3% 0+0k 0+0io 339pf+0w # your way.. code/monks> time perl -n -a -F\\\| -e'END { print "$v\t$k\n" while($k, +$v)=each%h} $h{$F[9]}++' input.list 16046 10.155.240.2 17948 10.155.240.3 66208 10.155.240.1 5.610u 0.010s 0:05.62 100.0% 0+0k 0+0io 336pf+0w
So it doesnt look like it matters after a certain size, but at least it shaves that little bit extra off.. ;)

Update: Im noticing that you aren't actually sorting anything in perl.. it could be that your time is getting chewed up however you are doing your sorting.. I would move from a one liner to say..
#!/usr/bin/perl # update2: this is parse.pl while (<>) { $f = ( split('\|') )[9]; # this needs to get optomized.. ($data{$f}->[0] = $f) =~ s/\.//g unless( $data{$f} ); $data{$f}->[1]++; } for ( sort { $data{$a}->[0] <=> $data{$b}->[0] } keys %data ) { print "$data{$_}->[1]\t$_\n"; } Sample run, Note after each run I go through and change one of the lin +es so that we arent running into OS caching. # file statscode/monks> wc -l input.list + 100202 input.list code/monks> ls -l input.list -rw-r--r-- 1 ericmc users 2410847 2003-04-09 22:11 input.list # and the first way code/monks> time perl -n -a -F\\\| -e'END { print "$v\t$k\n" while($k, +$v)=each%h} $h{$F[9]}++' input.list 16046 10.155.240.2 17948 10.155.240.3 66208 39.39.39.39 5.490u 0.000s 0:05.49 100.0% 0+0k 0+0io 336pf+0w # now for my first iteration without sorting.. code/monks> time perl -n -a -e'END { print "$v\t$k\n" while($k,$v)=eac +h%h} $h{(split(/\|/))[9]}++' input.list 17948 10.155.240.3 66208 39.39.39.39 16046 40.40.40.40 2.750u 0.000s 0:02.74 100.3% 0+0k 0+0io 339pf+0w # and now with the code from above... code/monks> time ./parse.pl input.list 66208 39.39.39.39 16046 40.40.40.40 17948 41.41.41.41 2.190u 0.010s 0:02.26 97.3% 0+0k 0+0io 349pf+0w
YMMV and you may want to sort on the count of hits as opposed to the IP, but thats a simple replace in the sort..

/* And the Creator, against his better judgement, wrote man.c */

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://249515]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others avoiding work at the Monastery: (4)
As of 2025-06-13 00:36 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?
    erzuuliAnonymous Monks are no longer allowed to use Super Search, due to an excessive use of this resource by robots.