http://www.perlmonks.org?node_id=126837

orbital has asked for the wisdom of the Perl Monks concerning the following question:

After posting my example of a Schwartzian Transform in Complex Sorting Optimization? for suggestions on optimization I made a few more discoveries.

I need to Thank tye for his sorting alternative, by building a single sort key and then sort of that, instead of or`ing my sort priority. I instantly notice 4 times the performance working on a 3MB file.

This enlightnment from tye made me question what other ways can I sort by and how can I sort faster, that is when I found A Fresh Look at Efficient Perl Sorting by Uri Guttman and Larry Rosler. This paper is incredible, if you have any questions about your own sorting techniques go read this paper now!!!

Taking my new sense of enlightnment I used tye code example to try and sort a very large log file. My test logfile was 100MB, my first problem was memory, loading the whole log into memory eventually sucked all my resources (382 MB of ram), since I was also building sort arrays and then an index array. So I made a very slight modification by making an array with references to the matching line not the line itself, I did this by getting the offset of the line. This saved me quite a bit of memory overhead at least enough to run the script without hitting virtual memory. Here is the new version of the code snippet:

my $infile = $ARGV[0]; my $outfile = $infile; $outfile =~ s/.{3}$/out/i; open(IN, $infile ) || die "can't open logfile: $infile\n"; open(OUT, ">$outfile" ) || die "can't create datfile: $outfile\n"; my ($offset, @sorton); while (<IN>) { if( m/^(.+?)\((\d+)\)\s-\s\[(.+?)\].+?"(.*?)"\.$/ ) { push @sorton, $1.pack("N",$2).$3; push @lines, $offset; 1; } $offset = tell(IN); } my @idx= sort { $sorton[$a] cmp $sorton[$b] } 0..$#sorton; for( @lines[@idx] ) { seek IN, $_, 0; print OUT scalar(<IN>); }

Yes I know I should be putting an exclusive lock on my file esp.since I am keying off the line offsets.

What I found from running this script on the large file was horrific. It took over 12 hours to sort this file! With a few very crued checks I found that the bottleneck was the line where I sort into @idx. All the steps above are completed within minutes, and the write step is done in less time then that. I guess my frusration is that this code sample was ported into C++, using a lot of the same techinques and the whole 100MB file ran in 2 minutes!!!! Thats a huge difference.

Does perl built in sort have that much overhead? Would I be better off not using the built in sort at all?