Fast/Efficient Sort for Large Files

mjmaresca has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Fast/Efficient Sort for Large Files (use the sort(1) program) by grinder (Bishop) on Dec 19, 2002 at 08:38 UTC
So, you want fast and efficient. Given the sample you have provided, the first two fields are fixed length. This simplifies the problem tremendously: `if( system( '/usr/bin/sort', '-o', $filename, $filename )) { my $exit = $? >> 8; my $sig = $? & 127; my $core = $? & 128; die "failed to sort $filename: exit code $exit, signal $sig, core? + $core\n"; }` [download] Were this not the case, I would have pored over the manpage until I could figure out which switches were needed to specify a trickier combination of field specifications, but that would not change the point of the argument that follows. If it's performance you are after, you have simply no choice but to use the `sort(1)` utility. It has years of optimisations and bug-fixes. There is nothing you can do in Perl which will approach its speed or reliability, so any custom solution you develop should use it as the basic building block. What you could do, if sort does not take multiple CPUs into account¹ (my Solaris box is a monoprocessor, so I can't tell), would be to take the file and split out every n million lines into separate files, so that each file is held on a different physical device (and not merely different slices of the same disk), and you have x files in total where x is the number of CPUs you want to sort with. Then spawn enough copies of sort to sort them all in parallel. This will spread the IO load across separate devices. And pray that the OS is smart enough to bind each instance to a separate CPU... When they're all done, mergesort them back together (with `sort -m`). <update>Having read grantm's reply elsewhere in this thread, it should be noted that `sort` has a command-line switch (`-y`) that can be employed to control its memory consumption.</update> Another solution would be to split the files into the same directory, using the `split(1)` command. I would predict that this will be much faster than using Perl to break the file up, although your IO contentions go up (during the sorting phase). You'll have to try all three approaches to see which one is the fastest. I'd go with the simplest solution first myself. The other solutions I have described sound like a lot of effort for what will probably be fairly meagre returns. If you do get something interesting out of it though, I'd be very interested to hear the results. <update>1. To clarify my statement "if sort does not take multiple CPUs into account," what I meant is that if the file to be sorted exceeds a certain size, `sort(1)` will perform a disk-based mergesort anyway, which is what the subsequent suggestions spell out in more detail. If you know you are going to go to disk, you may be able to beat its default behaviour, but I wouldn't like to lay money on it.</update> print@_{sort keys %_},$/if%_=split//,'= & *a?b:e\f/h^h!j+n,o@o;r$s-t%t#u'	[reply] [d/l]
Re: Fast/Efficient Sort for Large Files by grantm (Parson) on Dec 19, 2002 at 08:14 UTC
Just to re-iterate the chatterbox conversation: If you're looking for speed and efficiency, then Perl is probably not going to figure in the solution. The data looks fairly compatible with the Solaris sort command. If it chokes on the volume of data, try the GNU sort. If you still have problems, try using: The 'split' command to subdivide the main file into temp files The 'sort' command to sort each temp file (perhaps in parallel, but probably not due to memory demands) The 'sort -m' command to merge the sorted temp files The 'rm' command to clean up the temp files :-)	[reply]
Re: Fast/Efficient Sort for Large Files by shotgunefx (Parson) on Dec 19, 2002 at 07:44 UTC
In a similar situation, I piped out to GNU sort. If for some reason you don't want to do this, perhaps the source would be a good guide in implementing it in Perl (Inline::C perhaps?), of course there are GPL issues. I don't know if having more threads is going to speed things up. I would think that you're mainly IO bound. -Lee "To be civilized is to deny one's nature."	[reply]
Re: Fast/Efficient Sort for Large Files by Aristotle (Chancellor) on Dec 19, 2002 at 08:46 UTC
It might be worth investigating a solution that inserts the data in a DBD::SQLite database then just sorts using a `SELECT`. I've seen this applied with very good results to munging huge logfiles - although not anywhere near the size you mention, so I'm not sure how well it will work for you. In any case, I'd give it a shot - if it works, it'll be hardly more than two dozen lines of code (mostly DBI red tape) and pretty efficient too. Makeshifts last the longest.	[reply]
Re: Fast/Efficient Sort for Large Files by BrowserUk (Patriarch) on Dec 19, 2002 at 17:51 UTC
FWIW. The code below, a split/sort/merge sort (possibly a Merge sort, but I don't have Knuth handy), processes a 1 million record simulation of your data on a 233MHz/256MB machine with a badly fragged disc in a reasonable time. With 8 cpu's, this algorithm begs for the addition of forking at the split file sort position. Not hard to do, but entirely worthless exercise for me with my hardware. On the kind of hardware you have it should fly. It takes two parameters, infile & outfile, and 1 option -N=n, which is the number of characters at the front of the line to use to control the number of temp files used. I used 2, for approx. 100 files of around ~~4 MB~~(should've used that commify routine on the numbers:) 400Kb each, but with your hardware, the default of 1 for 10 files should do. You'll need to check the sort parms as I made a goof when gen'ing my test data. I think they are ok as is, but check. Maybe I'll get around to making the keys specifiable from the command line. There is a lot that could be done to make this stronger and more efficient, but it's somewhere for you to start should you feel so inclined to write your own. #! perl -sw use vars qw/$N/; use strict; no strict 'refs'; $\|++; my $reclen = 37; #! Adjust to suit your records/line ends. $N = $N \|\| 1; warn "Usage: $0 [-N=n] file\n" and exit(-1) unless @ARGV; warn "Reading input file $ARGV[0] ", -s $ARGV[0], "\n"; if ( not defined $ARGV[1] ) { warn "Output file not specified a Continue[N\|y]?"; exit -1 if <STDIN> !~ /^Y/i; } $/= \$reclen; open INPUT, '<', $ARGV[0] or die $!, $ARGV[0]; binmode(INPUT); my (@fhs); while ( <INPUT> ) { my $key = substr($_, 0, $::N); if (not defined $fhs[$key]) { $fhs[$key] = "temp.$key"; warn( "\rCreating file: $fhs[$key] "); open( $fhs[$key], ">$fhs[$key]") or die( "Could create $fhs[$key]: $!"); binmode($fhs[$key]); } print {$fhs[$key]} $_; } #! Get rid of unused filehandles or those that reference zero length f +iles. @fhs = grep{ $_ and ! -z $_} @fhs; close $_ for @fhs; close INPUT; warn "Split made to: ", scalar @fhs, " files\n"; #! Sort the split files on the first & second field for my $fh (@fhs) { warn "$fh: reading;..."; open $fh, "<$fh" or die $!; binmode($fh); my @recs = <$fh>; close $fh; warn " sorting: ", scalar @recs, " recs;..."; @recs = sort{ substr($b, 0, 5) <=> substr($a, 0, 5) \|\| substr($b, 6, 11) <=> substr($a, 6, 11) } @recs; warn " writing;..."; open $fh, ">$fh" or die $!; binmode($fh); print $fh @recs; close $fh; warn "done;\n"; } warn "Merging files: "; SORTED = *STDOUT; open SORTED, '>', $ARGV[1] and binmode(SORTED) or die $! if $ARGV[1]; for my $fh (reverse @fhs) { warn " $fh;"; open $fh, "<$fh" and binmode($fh) or die $!; print SORTED <$fh>; close $fh; } warn "\nClosing sorted file: sorted\n"; close SORTED; warn "Deleting temp files\n"; unlink $_ or warn "Couldn't unlink $_\n" for @fhs; warn "Done.\n"; exit (0); [download] Examine what is said, not who speaks.	[reply] [d/l]
Re: Fast/Efficient Sort for Large Files by perrin (Chancellor) on Dec 19, 2002 at 14:29 UTC
Perl 5.8 uses merge sort instead of qsort. If you aren't already running 5.8, try it. You should see an improvement, unless your program is totally I/O bound.	[reply]
Re: Re: Fast/Efficient Sort for Large Files by Fletch (Bishop) on Dec 19, 2002 at 16:34 UTC
By default 5.8 uses a merge sort, but you can revert to quicksort if that'd work better for your data. See `perldoc -f sort` and `perldoc sort` for more explanation.	[reply] [d/l] [select]
Re: Re: Re: Fast/Efficient Sort for Large Files by cecil36 (Pilgrim) on Dec 19, 2002 at 17:35 UTC
I wrote a program in C++ to sort web server cache logs by IP address. I discovered that a quicksort took forever to sort the data by the IP address, but a heapsort routine that I had in a programming manual was more efficient. If possible, see if you can utilize a binsort routine. If I recall my algorithms class correctly, binsort runs in O(n) time.	[reply]
Re: Fast/Efficient Sort for Large Files by waswas-fng (Curate) on Dec 19, 2002 at 16:55 UTC
I have done this type of thing a lot on Solaris boxes, I need to know a few more bits of data from you to give you a fast working solution: 1, What type of sort are you using? Are you sorking on multiple "keys"? If so are all the sorts the same direction (all accending or mix and match). Give me an example from your posted dataset. 2, I need to know what your system usage looks like can you dump the second display of mpstat 1 2 while your perl version is running. Let me know and I will be able to help. Edited, I missed some info on your post, here is what I got just running a test on my devel Sun420r 2 proc 2gb ram box: 24 mil records in 20.25 minutes. Here is how I did it: `/usr/bin/sort -k1 -k2 -T /data2 -o t.sorted t` this says "sort on the first field then if there are duplicates sort on the second, use /data2 as a temp file area and output to t.sorted instead of stdout. You will need 4x <size of inputfile> space free on the temp working dir. Also your sortrate may be better than mine cause I had a lot of duplicate records which forced the secondary key sort to happen way more frequently. If you need to sort on 40000600045 first and then break ties with the first field use -k2 -k1. Let me know if you have questions. -Waswas	[reply] [d/l]
Re: Re: Fast/Efficient Sort for Large Files by waswas-fng (Curate) on Dec 19, 2002 at 17:58 UTC
Just to add a few more points, that sort command above automaticaly does a split sort merge sort, it is still single threaded but I dont know that you will be able to get much faster by forking or threading it in perl. Also unlike the posts above I would say do _NOT_ use the mem limiter flag on the sort command as it makes sort way slower. Solaris's sort is smart enough to use as much memory as it can to make the sort faster -- and not try to do _everything_ in memory unless it can. Overall Solairs sort has proved to be very fast compaired to perl's sort except for the occasions where you must do >5 or 6 compound compairs row or do inline mods to the data. -Waswas	[reply]
Re: Fast/Efficient Sort for Large Files by waswas-fng (Curate) on Dec 19, 2002 at 18:38 UTC
One more note, if you are have some budget, nsort is very fast for this type of work, http://www.ordinal.com/performance.html, for example it does that same test sort (24 mil recoreds) in < 5 minutes on my 420r 2proc sun. The only catch is that it is pretty spendy I think. -Waswas	[reply]
Re: Fast/Efficient Sort for Large Files by PhiRatE (Monk) on Dec 19, 2002 at 21:43 UTC
Everyone else has offered helpful suggestions so I'll ask the big question. How is it that you have 100,000,000 records that need to be sorted, and yet it isn't in a database?	[reply]
Re: Re: Fast/Efficient Sort for Large Files by waswas-fng (Curate) on Dec 19, 2002 at 22:34 UTC
I suppose there are many reasons why you could have 100mil records in a flat file, as a sysadmin some that come to mind are log files for busy web servers, name servers, terse logging for samba. Also there are programs in the wild (beleave it or not) that we don't have the source for that have hardcoded outputs. =) oo oo and those pesky 10 year old propriatary mainframe apps with there gooffy output. /me wishes he could wave his magic wand have all of this data magically be in relational DBs -- ACK! but then what would we really need a Practical Extraction and Report Language for? =) -Waswas	[reply]
Re^3: Fast/Efficient Sort for Large Files by Aristotle (Chancellor) on Dec 20, 2002 at 00:40 UTC
For DBI? `;)` Makeshifts last the longest.	[reply]
Re: Re^3: Fast/Efficient Sort for Large Files by waswas-fng (Curate) on Dec 20, 2002 at 07:09 UTC
Re: Fast/Efficient Sort for Large Files by Anonymous Monk on Dec 20, 2002 at 00:10 UTC
I faced something similar. The bottleneck wasn't precisely file size, but the fact that it was unknown in advance. (That's also why it wasn't in a database - it came from elsewhere.) I didn't end up using Chris Nandor's File::Sort as I first intended, among other things because I didn't want to put the time in tuning and because I wanted to hook in a progress indicator. I provided a bespoke analogue of tape sort instead. This has recursion in the logic, but the actual code isn't recursive. I provide some snippets below; they tune their set up with the built in sort, and they are slightly tuned by doing the final merge straight to a file, but there is room for more improvement. PML. Read more... (12 kB)	[reply] [d/l]
Re: Fast/Efficient Sort for Large Files by mce (Curate) on Dec 19, 2002 at 10:36 UTC
Hi, Have a look at the schwartzian transform. Google or super search on it, and you'll get lots of hits What is does is, it takes a subset of your strings and sorts these, and than uses the whole array, and rearrages it based on the sorted subset. --------------------------- Dr. Mark Ceulemans Senior Consultant IT Masters, Belgium	[reply]
Re: Re: Fast/Efficient Sort for Large Files by tommyw (Hermit) on Dec 19, 2002 at 12:27 UTC
No it doesn't. It pre-computes the sort keys, sorts the entire array based on the keys, and discards them again, rather than using a naive sort function which recomputes the keys every time it needs them. And it's not going to help in this case, as the sort key is simply the initial substring of the data. -- Tommy Too stupid to live. Too stubborn to die.	[reply]
Re^2: Fast/Efficient Sort for Large Files by Aristotle (Chancellor) on Dec 19, 2002 at 13:42 UTC
In addition to tommyw's comments, a Schwartzian Transform also consumes extra memory, so even if it made sense here, it wouldn't exactly be efficient. Makeshifts last the longest.	[reply]