Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

Fast/Efficient Sort for Large Files

by mjmaresca (Sexton)
on Dec 19, 2002 at 07:25 UTC ( [id://221059]=perlquestion: print w/replies, xml ) Need Help??

mjmaresca has asked for the wisdom of the Perl Monks concerning the following question:

I need to be able to sort large files of the variety: Example:
18000 40000600000 1.2 2.3 2.2 5.6 7.2 20000 40000600001 2.2 3.5 3.4 5.6 3.5 15000 40000600003 2.2 3.5 3.4 5.6 3.5 20000 40000600006 2.2 3.5 3.4 5.6 3.5 20000 40000600010 2.2 3.5 3.4 5.6 3.5 20000 40000600015 2.2 3.5 3.4 5.6 3.5 20000 40000600021 2.2 3.5 3.4 5.6 3.5 20000 40000600028 2.2 3.5 3.4 5.6 3.5 20000 40000600036 2.2 3.5 3.4 5.6 3.5 20000 40000600045 1.2 2.3 2.2 5.6 7.2 19000 40000600000 1.2 2.3 2.2 5.6 7.2 19000 40000600001 1.2 2.3 2.2 5.6 7.2 19000 40000600003 1.2 2.3 2.2 5.6 7.2 19000 40000600006 1.2 2.3 2.2 5.6 7.2 20000 40000600010 2.2 3.5 3.4 5.6 3.5 19000 40000600015 2.2 3.5 3.4 5.6 3.5 20000 40000600021 2.2 3.5 3.4 5.6 3.5 19000 40000600028 2.2 3.5 3.4 5.6 3.5 20000 40000600036 2.2 3.5 3.4 5.6 3.5 19000 40000600045 2.2 3.5 3.4 5.6 3.5
The sort needs to be in ascending order based on the first 2 fields of the record. I will need to sort 100 million records. Currently, I'm using the File::Sort module and getting through about 4 million records in about 8 minutes, but its single threaded. Seems like I need to implement some sort of Merge/Sort, where I sort the smaller files and then merge. Has anybody implemented a perl solution that can handle this many records per file? I'm running on a Sun F4800, 8 CPUs, 2GB RAM - Solaris 8.

Edit: Added <code> tags around the data. larsen

Replies are listed 'Best First'.
Re: Fast/Efficient Sort for Large Files (use the sort(1) program)
by grinder (Bishop) on Dec 19, 2002 at 08:38 UTC

    So, you want fast and efficient. Given the sample you have provided, the first two fields are fixed length. This simplifies the problem tremendously:

    if( system( '/usr/bin/sort', '-o', $filename, $filename )) { my $exit = $? >> 8; my $sig = $? & 127; my $core = $? & 128; die "failed to sort $filename: exit code $exit, signal $sig, core? + $core\n"; }

    Were this not the case, I would have pored over the manpage until I could figure out which switches were needed to specify a trickier combination of field specifications, but that would not change the point of the argument that follows.

    If it's performance you are after, you have simply no choice but to use the sort(1) utility. It has years of optimisations and bug-fixes. There is nothing you can do in Perl which will approach its speed or reliability, so any custom solution you develop should use it as the basic building block.

    What you could do, if sort does not take multiple CPUs into account1 (my Solaris box is a monoprocessor, so I can't tell), would be to take the file and split out every n million lines into separate files, so that each file is held on a different physical device (and not merely different slices of the same disk), and you have x files in total where x is the number of CPUs you want to sort with. Then spawn enough copies of sort to sort them all in parallel. This will spread the IO load across separate devices. And pray that the OS is smart enough to bind each instance to a separate CPU... When they're all done, mergesort them back together (with sort -m).

    <update>Having read grantm's reply elsewhere in this thread, it should be noted that sort has a command-line switch (-y) that can be employed to control its memory consumption.</update>

    Another solution would be to split the files into the same directory, using the split(1) command. I would predict that this will be much faster than using Perl to break the file up, although your IO contentions go up (during the sorting phase). You'll have to try all three approaches to see which one is the fastest. I'd go with the simplest solution first myself. The other solutions I have described sound like a lot of effort for what will probably be fairly meagre returns. If you do get something interesting out of it though, I'd be very interested to hear the results.

    <update>1. To clarify my statement "if sort does not take multiple CPUs into account," what I meant is that if the file to be sorted exceeds a certain size, sort(1) will perform a disk-based mergesort anyway, which is what the subsequent suggestions spell out in more detail. If you know you are going to go to disk, you may be able to beat its default behaviour, but I wouldn't like to lay money on it.</update>


    print@_{sort keys %_},$/if%_=split//,'= & *a?b:e\f/h^h!j+n,o@o;r$s-t%t#u'
Re: Fast/Efficient Sort for Large Files
by grantm (Parson) on Dec 19, 2002 at 08:14 UTC

    Just to re-iterate the chatterbox conversation:

    If you're looking for speed and efficiency, then Perl is probably not going to figure in the solution.

    The data looks fairly compatible with the Solaris sort command. If it chokes on the volume of data, try the GNU sort.

    If you still have problems, try using:

    • The 'split' command to subdivide the main file into temp files
    • The 'sort' command to sort each temp file (perhaps in parallel, but probably not due to memory demands)
    • The 'sort -m' command to merge the sorted temp files
    • The 'rm' command to clean up the temp files :-)
Re: Fast/Efficient Sort for Large Files
by shotgunefx (Parson) on Dec 19, 2002 at 07:44 UTC
    In a similar situation, I piped out to GNU sort. If for some reason you don't want to do this, perhaps the source would be a good guide in implementing it in Perl (Inline::C perhaps?), of course there are GPL issues.

    I don't know if having more threads is going to speed things up. I would think that you're mainly IO bound.

    -Lee

    "To be civilized is to deny one's nature."
Re: Fast/Efficient Sort for Large Files
by Aristotle (Chancellor) on Dec 19, 2002 at 08:46 UTC
    It might be worth investigating a solution that inserts the data in a DBD::SQLite database then just sorts using a SELECT. I've seen this applied with very good results to munging huge logfiles - although not anywhere near the size you mention, so I'm not sure how well it will work for you. In any case, I'd give it a shot - if it works, it'll be hardly more than two dozen lines of code (mostly DBI red tape) and pretty efficient too.

    Makeshifts last the longest.

Re: Fast/Efficient Sort for Large Files
by BrowserUk (Patriarch) on Dec 19, 2002 at 17:51 UTC

    FWIW. The code below, a split/sort/merge sort (possibly a Merge sort, but I don't have Knuth handy), processes a 1 million record simulation of your data on a 233MHz/256MB machine with a badly fragged disc in a reasonable time.

    With 8 cpu's, this algorithm begs for the addition of forking at the split file sort position. Not hard to do, but entirely worthless exercise for me with my hardware.

    On the kind of hardware you have it should fly.

    It takes two parameters, infile & outfile, and 1 option -N=n, which is the number of characters at the front of the line to use to control the number of temp files used. I used 2, for approx. 100 files of around 4 MB(*should've used that commify routine on the numbers:) 400Kb each, but with your hardware, the default of 1 for 10 files should do.

    You'll need to check the sort parms as I made a goof when gen'ing my test data. I think they are ok as is, but check. Maybe I'll get around to making the keys specifiable from the command line.

    There is a lot that could be done to make this stronger and more efficient, but it's somewhere for you to start should you feel so inclined to write your own.

    #! perl -sw use vars qw/$N/; use strict; no strict 'refs'; $|++; my $reclen = 37; #! Adjust to suit your records/line ends. $N = $N || 1; warn "Usage: $0 [-N=n] file\n" and exit(-1) unless @ARGV; warn "Reading input file $ARGV[0] ", -s $ARGV[0], "\n"; if ( not defined $ARGV[1] ) { warn "Output file not specified a Continue[N|y]?"; exit -1 if <STDIN> !~ /^Y/i; } $/= \$reclen; open INPUT, '<', $ARGV[0] or die $!, $ARGV[0]; binmode(INPUT); my (@fhs); while ( <INPUT> ) { my $key = substr($_, 0, $::N); if (not defined $fhs[$key]) { $fhs[$key] = "temp.$key"; warn( "\rCreating file: $fhs[$key] "); open( $fhs[$key], ">$fhs[$key]") or die( "Could create $fhs[$key]: $!"); binmode($fhs[$key]); } print {$fhs[$key]} $_; } #! Get rid of unused filehandles or those that reference zero length f +iles. @fhs = grep{ $_ and ! -z $_} @fhs; close $_ for @fhs; close INPUT; warn "Split made to: ", scalar @fhs, " files\n"; #! Sort the split files on the first & second field for my $fh (@fhs) { warn "$fh: reading;..."; open $fh, "<$fh" or die $!; binmode($fh); my @recs = <$fh>; close $fh; warn " sorting: ", scalar @recs, " recs;..."; @recs = sort{ substr($b, 0, 5) <=> substr($a, 0, 5) || substr($b, 6, 11) <=> substr($a, 6, 11) } @recs; warn " writing;..."; open $fh, ">$fh" or die $!; binmode($fh); print $fh @recs; close $fh; warn "done;\n"; } warn "Merging files: "; *SORTED = *STDOUT; open SORTED, '>', $ARGV[1] and binmode(SORTED) or die $! if $ARGV[1]; for my $fh (reverse @fhs) { warn " $fh;"; open $fh, "<$fh" and binmode($fh) or die $!; print SORTED <$fh>; close $fh; } warn "\nClosing sorted file: sorted\n"; close SORTED; warn "Deleting temp files\n"; unlink $_ or warn "Couldn't unlink $_\n" for @fhs; warn "Done.\n"; exit (0);

    Examine what is said, not who speaks.

Re: Fast/Efficient Sort for Large Files
by perrin (Chancellor) on Dec 19, 2002 at 14:29 UTC
    Perl 5.8 uses merge sort instead of qsort. If you aren't already running 5.8, try it. You should see an improvement, unless your program is totally I/O bound.

      By default 5.8 uses a merge sort, but you can revert to quicksort if that'd work better for your data. See perldoc -f sort and perldoc sort for more explanation.

        I wrote a program in C++ to sort web server cache logs by IP address. I discovered that a quicksort took forever to sort the data by the IP address, but a heapsort routine that I had in a programming manual was more efficient. If possible, see if you can utilize a binsort routine. If I recall my algorithms class correctly, binsort runs in O(n) time.
Re: Fast/Efficient Sort for Large Files
by waswas-fng (Curate) on Dec 19, 2002 at 16:55 UTC
    I have done this type of thing a lot on Solaris boxes, I need to know a few more bits of data from you to give you a fast working solution:

    1, What type of sort are you using? Are you sorking on multiple "keys"? If so are all the sorts the same direction (all accending or mix and match). Give me an example from your posted dataset.

    2, I need to know what your system usage looks like can you dump the second display of mpstat 1 2 while your perl version is running.

    Let me know and I will be able to help.

    Edited, I missed some info on your post, here is what I got just running a test on my devel Sun420r 2 proc 2gb ram box: 24 mil records in 20.25 minutes. Here is how I did it:
    /usr/bin/sort -k1 -k2 -T /data2 -o t.sorted t
    this says "sort on the first field then if there are duplicates sort on the second, use /data2 as a temp file area and output to t.sorted instead of stdout. You will need 4x <size of inputfile> space free on the temp working dir. Also your sortrate may be better than mine cause I had a lot of duplicate records which forced the secondary key sort to happen way more frequently. If you need to sort on 40000600045 first and then break ties with the first field use -k2 -k1. Let me know if you have questions.

    -Waswas
      Just to add a few more points, that sort command above automaticaly does a split sort merge sort, it is still single threaded but I dont know that you will be able to get much faster by forking or threading it in perl. Also unlike the posts above I would say do _NOT_ use the mem limiter flag on the sort command as it makes sort way slower. Solaris's sort is smart enough to use as much memory as it can to make the sort faster -- and not try to do _everything_ in memory unless it can. Overall Solairs sort has proved to be very fast compaired to perl's sort except for the occasions where you must do >5 or 6 compound compairs row or do inline mods to the data.

      -Waswas
Re: Fast/Efficient Sort for Large Files
by waswas-fng (Curate) on Dec 19, 2002 at 18:38 UTC
    One more note, if you are have some budget, nsort is very fast for this type of work, http://www.ordinal.com/performance.html, for example it does that same test sort (24 mil recoreds) in < 5 minutes on my 420r 2proc sun. The only catch is that it is pretty spendy I think.

    -Waswas
Re: Fast/Efficient Sort for Large Files
by PhiRatE (Monk) on Dec 19, 2002 at 21:43 UTC
    Everyone else has offered helpful suggestions so I'll ask the big question.

    How is it that you have 100,000,000 records that need to be sorted, and yet it isn't in a database?

      I suppose there are many reasons why you could have 100mil records in a flat file, as a sysadmin some that come to mind are log files for busy web servers, name servers, terse logging for samba. Also there are programs in the wild (beleave it or not) that we don't have the source for that have hardcoded outputs. =) oo oo and those pesky 10 year old propriatary mainframe apps with there gooffy output. /me wishes he could wave his magic wand have all of this data magically be in relational DBs -- ACK! but then what would we really need a Practical Extraction and Report Language for? =)

      -Waswas
        For DBI? ;)

        Makeshifts last the longest.

Re: Fast/Efficient Sort for Large Files
by Anonymous Monk on Dec 20, 2002 at 00:10 UTC
    I faced something similar. The bottleneck wasn't precisely file size, but the fact that it was unknown in advance. (That's also why it wasn't in a database - it came from elsewhere.) I didn't end up using Chris Nandor's File::Sort as I first intended, among other things because I didn't want to put the time in tuning and because I wanted to hook in a progress indicator. I provided a bespoke analogue of tape sort instead. This has recursion in the logic, but the actual code isn't recursive. I provide some snippets below; they tune their set up with the built in sort, and they are slightly tuned by doing the final merge straight to a file, but there is room for more improvement. PML.
Re: Fast/Efficient Sort for Large Files
by mce (Curate) on Dec 19, 2002 at 10:36 UTC
    Hi,

    Have a look at the schwartzian transform. Google or super search on it, and you'll get lots of hits

    What is does is, it takes a subset of your strings and sorts these, and than uses the whole array, and rearrages it based on the sorted subset.
    ---------------------------
    Dr. Mark Ceulemans
    Senior Consultant
    IT Masters, Belgium

      No it doesn't. It pre-computes the sort keys, sorts the entire array based on the keys, and discards them again, rather than using a naive sort function which recomputes the keys every time it needs them.

      And it's not going to help in this case, as the sort key is simply the initial substring of the data.

      --
      Tommy
      Too stupid to live.
      Too stubborn to die.

      In addition to tommyw's comments, a Schwartzian Transform also consumes extra memory, so even if it made sense here, it wouldn't exactly be efficient.

      Makeshifts last the longest.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://221059]
Approved by Bukowski
Front-paged by gmax
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others lurking in the Monastery: (2)
As of 2024-04-25 07:46 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found