Slow at sorting?

orbital has asked for the wisdom of the Perl Monks concerning the following question:

After posting my example of a Schwartzian Transform in Complex Sorting Optimization? for suggestions on optimization I made a few more discoveries.

I need to Thank tye for his sorting alternative, by building a single sort key and then sort of that, instead of or`ing my sort priority. I instantly notice 4 times the performance working on a 3MB file.

This enlightnment from tye made me question what other ways can I sort by and how can I sort faster, that is when I found A Fresh Look at Efficient Perl Sorting by Uri Guttman and Larry Rosler. This paper is incredible, if you have any questions about your own sorting techniques go read this paper now!!!

Taking my new sense of enlightnment I used tye code example to try and sort a very large log file. My test logfile was 100MB, my first problem was memory, loading the whole log into memory eventually sucked all my resources (382 MB of ram), since I was also building sort arrays and then an index array. So I made a very slight modification by making an array with references to the matching line not the line itself, I did this by getting the offset of the line. This saved me quite a bit of memory overhead at least enough to run the script without hitting virtual memory. Here is the new version of the code snippet:

my $infile = $ARGV[0];
my $outfile = $infile;
$outfile =~ s/.{3}$/out/i;

open(IN, $infile ) || die "can't open logfile: $infile\n";
open(OUT, ">$outfile" ) || die "can't create datfile: $outfile\n";

my ($offset, @sorton);
while (<IN>) {
    if(  m/^(.+?)\((\d+)\)\s-\s\[(.+?)\].+?"(.*?)"\.$/  ) {
        push @sorton, $1.pack("N",$2).$3;
        push @lines, $offset;
        1;
    } 
    $offset = tell(IN);
}

my @idx= sort { $sorton[$a] cmp $sorton[$b] } 0..$#sorton;

for( @lines[@idx] ) {
 seek IN, $_, 0;
 print OUT scalar(<IN>);
}
[download]

Yes I know I should be putting an exclusive lock on my file esp.since I am keying off the line offsets.

What I found from running this script on the large file was horrific. It took over 12 hours to sort this file! With a few very crued checks I found that the bottleneck was the line where I sort into @idx. All the steps above are completed within minutes, and the write step is done in less time then that. I guess my frusration is that this code sample was ported into C++, using a lot of the same techinques and the whole 100MB file ran in 2 minutes!!!! Thats a huge difference.

Does perl built in sort have that much overhead? Would I be better off not using the built in sort at all?

Comment on Slow at sorting? Download Code

Replies are listed 'Best First'.

Re (tilly) 1: Slow at sorting?
by tilly (Archbishop) on Nov 22, 2001 at 02:06 UTC

DB_File

tilly's scratchpad

use strict;
use DB_File;

my %sorted;
$DB_BTREE->{compare} = sub {$_[1] cmp $_[0]};
tie (%sorted, 'DB_File', undef, O_RDWR|O_CREAT, 0640, $DB_BTREE) or di
+e $!;

%sorted = 1..30;
print map "$_\n", keys %sorted;
[download]

tye

each

keys

If your OS has large file support, your limit is available disk space. If it does not have large file support, your limit is about 2 GB for the temp file, which is a data structure for (considering the BTREE overhead) somewhat over a GB of data. (I would guestimate about 1.3 GB.)

There is also a File::Sort out there. I prefer the arbitrary text of DB_File rather than having the assumption of lines imposed on me. And if you want to sort data structures rather than text, be sure to look at MLDBM (and expect a big slowdown).

[reply]
[d/l]

Re: Slow at sorting?
by dws (Chancellor) on Nov 22, 2001 at 00:09 UTC

My test logfile was 100MB, my first problem was memory, loading the whole log into memory eventually sucked all my resources (382 MB of ram)

Whoa! Even after you've made one pass to build your sort key, chances are quite good that you're falling prey to virtual memory thrashing. If that's the case, then the C++ code might well run faster on the 100MB sample (depending on how you code, C++ is going to have a lower memory footprint), but at some point you're going to throw more data at the C++ code, and it'll fall over, too. Do you have any visibility onto virtual memory behavior (e.g., can you get a running page-fault count?)

As I see it, you have a few choices if you want to stay with Perl:

Add more physical memory (which these days is an inexpensive proposition), or
Break the file into smaller pieces, sort the pieces, then merge them later, or
Do both

[reply]

Re: Slow at sorting?
by shotgunefx (Parson) on Nov 22, 2001 at 01:13 UTC

Another possibility. When faced with sorting an 800 Meg Search index, I gave up using pure Perl and just used system to punt it off to GNU sort. It went from over 8 hours to around 30 minutes. I had no desire to recreate the functionality of handling files this size. I just needed it to work.

You also might want to check out Mastering Algorithims in Perl by O'Reilly. It has a really good chapter on sorting.

-Lee

"To be civilized is to deny one's nature."

[reply]

Re (tilly) 2: Slow at sorting?

by tilly (Archbishop) on Nov 22, 2001 at 01:56 UTC

There are slow patterns, which is why Perl 5.8 will default to a mergesort. But the slow patterns are a lot harder to hit than just sorting the data.

[reply]

Re(3): Slow at sorting?

by FoxtrotUniform (Prior) on Nov 22, 2001 at 02:39 UTC

Out of curiosity, why mergesort over (for example) heapsort?

-- :wq

[reply]

Re (tilly) 4: Slow at sorting?

by tilly (Archbishop) on Nov 22, 2001 at 03:21 UTC

Re: Re (tilly) 2: Slow at sorting?

by shotgunefx (Parson) on Nov 22, 2001 at 02:15 UTC

[reply]

Re: Slow at sorting?
by mirod (Canon) on Nov 22, 2001 at 02:54 UTC

Just one comment on sort efficiency: contrary to what a couple of posters seem to think, Perl sorts strings faster than numbers. The whole point of Uri and Larry's paper is that as string sort is the default, hard-coded sort, it does not have to call any Perl subroutine, and thus works _real_ fast. So the trick to fast sorting is to be able to do just a sort @array, not a sort {$a <=> $b}, @array.

And BTW if you are looking for a module to do this automagically, well it looks like File::Records is looking for a new owner.

[reply]
[d/l]

Re: Slow at sorting?
by clintp (Curate) on Nov 22, 2001 at 01:11 UTC

Your sample data set wasn't very large. Would it be possible for you to make even smaller sort keys? Getting the memory footprint of @sorton a bit smaller might help. Since the sort seems to be the killer, and not the preparation time, you might want to spend more time prepping the data.
Can you do that by making a completely numeric sort key? Since, except for bracketed field ([]), everything else seems to be numeric. Then you wouldn't need an alphabetic comparison here. Packing things down into a smaller value might prove useful. Watch for overflows.
At the very least, can you get that bracketed field down to a couple of characters instead of a long string?
How many records _are_ we talking about? Not "100MB" but, how many lines of data?
(tiny optimization)Can you pack the offset into the sort key as well? Your sort would be sort @sorton instead of needing the indirection there.

[reply]

Re: Slow at sorting?
by orbital (Scribe) on Nov 22, 2001 at 00:18 UTC

I know C++ code is going to be faster but this seems to be way way too fast.

[reply]

Re: Slow at sorting?
by orbital (Scribe) on Nov 22, 2001 at 01:31 UTC

clintp

I gave a sample of the matching lines in my previous post, but incase you missed it here you go.

CD1\01100809.pdf(1) - [Account Number] Indexed key "654546654".
CD2\01100809.pdf(1) - [Invoice Date] Indexed key "10/08/2001".
CD1\01100809.pdf(1) - [Customer Name] Indexed key "FOOBAR".
CD2\01100809.pdf(1) - [Contact Name] Indexed key "Dr. FOO".
CD4\01100809.pdf(20) - [Account Number] Indexed key "54356564".
[download]

If you really want the full logfile I could strip out all sensitive data and throw it out somewhere where you can grab it, if this is something you really want to play with let me know...

[reply]
[d/l]

Re: Re: Slow at sorting?

by dws (Chancellor) on Nov 22, 2001 at 01:47 UTC

CD1\01100809.pdf(1) - [Account Number] Indexed key "654546654".

According to an earlier post, the bold fields are the ones you're sorting on. If that's truly the case, and if the text within brackets comes from a limited dictionary, then this sort can be done entirely with numbers.

First, combine the CD number with the file number, yielding (in this case) 101100809 Then map "Account Number" into its pre-determined sequence number. You're then left sorting something like [ 101100809, 47, seek-address ] on the first two fields, which are now numbers.

[reply]
[d/l]
[select]

Re: Slow at sorting?

by guha (Priest) on Nov 22, 2001 at 02:42 UTC

pack("n N n", $1, $2, $3).$4

thereby we would sort CD# > 9 correctly, decrease the memory footprint and allow for the fast intrinsic ASCIIbetical sort.
Perhaps even to the point where fast hash lookups can be used.

It's a pity the data is in practice unavailable

Here's what I would like to time ...

.
.
my (%sort_hash);
my $offset = 0;
while (<IN>) {
    if(  m/^CD(\d+)\\(\d+)\.pdf\((\d+)\)\s-\s\[(.+?)\]/  ) {
        $sort_hash{pack("n N n", $1, $2, $3).$4} = $offset;
    } 
    $offset = tell(IN);
}

foreach my $k (sort keys %sort_hash) {
   seek IN, $sort_hash{$k}, 0;
   print OUT scalar(<IN>);
}
[download]

[reply]
[d/l]

Re: Re: Slow at sorting?

by petral (Curate) on Nov 22, 2001 at 03:56 UTC

Re: Re: Re: Slow at sorting?

by guha (Priest) on Nov 22, 2001 at 04:08 UTC

Re: Slow at sorting?
by guha (Priest) on Nov 22, 2001 at 00:29 UTC

Are for instance $1 and $3 of constant length ?

Update

Oops, just checked the previous post and found it! :-\

[reply]

Re: Slow at sorting?
by kwoff (Friar) on Nov 22, 2001 at 02:17 UTC

Is it possible to put the log into a database table? In that case, put an index on whatever columns need sorted and forget about perl sorting. Unless that kind of thing amuses you. :)

[reply]

Re: Slow at sorting?
by guha (Priest) on Nov 23, 2001 at 20:06 UTC

So I made myself some data with a script approximating something like orbitals example, using the rand()-funktion on CD#, pdf-name, unknown digits in parenthesis and the different strings in brackets.
I know it's not the real thing but for what I trying to find out it seems OK to me.

The two approaches I've tested is either using a hash, straightforward and easy on the programmer, see my previous post. Or using an array and tucking the filepos to the right side, see petrals post.

The tests were run on a measly Pentium 133 with 64MB RAM and the results are as of the table below(YMWV).

The empty cells indicate heavy use of virtual memory.
Impatience being a virtue I only gave it double the expected time before I hit Ctrl-C.

kLines Filesize Hash Array

MB sec sec

100 6,16 11 12

150 9,24 16 18

175 10,78 19 22

190 11,704 22 24

200 12,32 23 25

210 12,936 24 26

220 13,552 25 27

230 14,168 62 29

250 15,4 31

300 18,48 37

350 21,56 44

400 24,64 107

500 30,8

My interpretation is that if you got the memory then the hash method is slightly faster, but using the array method will take you about twice as far in term of possibel file sizes.

Thinking about it, it seems rather logical considering that the hash is both key and value whilst array is value only.

It's also nice to see that both variations behave linearly with increasing volume until VM sets in, just as it's written in A Fresh Look at Efficient Perl Sorting by Uri Guttman and Larry Rosler.

[reply]


laziness, impatience, and hubris
	PerlMonks