Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Re: sorting very large text files

by gam3 (Curate)
on Jan 06, 2010 at 03:40 UTC ( #815874=note: print w/ replies, xml ) Need Help??


in reply to sorting very large text files

If the file is not so big that the keys will not fit into memory you can do this:

#!/usr/bin/perl use strict; use warnings; use vars qw( @data $prev $IN @sorted $OUT); $prev = 0; open($IN, '<', 'data'); while (<$IN>) { chomp; my $next = tell($IN); my ($key) = /^([^\s]*)/; push(@data, [ $key, $prev, $next - $prev ]); $prev = $next; } close($IN); @sorted = sort( { $a->[0] cmp $b->[0] } @data); open($IN, '<', 'data'); open($OUT, '>', 'sorteddata'); for (@sorted) { seek($IN, $_->[1], 0); sysread($IN, my $line, $_->[2]); print $OUT $line; } close($IN); close($OUT);
On the cooked data I tested, I got the following timings:
Gnu Sort:
# time sort --temporary-directory=/opt data > sort1

real    0m24.698s
user    0m22.539s
sys     0m1.950s

Perl Sort:
# time perl sort.pl 

real    0m55.900s
user    0m39.897s
sys     0m6.430s
The data file I used had a wc of:
#wc data
  4915200  34406400 383385600 data
I am surprised that this Perl script is only half the speed of Gnu sort on this data. I think that on a bigger data set, with long lines, it might even be able to sort faster that Gnu Sort.

UPDATE: Most of the time seems to be being spent in the output loop. All of the seeks seems to really slow things down.

-- gam3
A picture is worth a thousand words, but takes 200K.


Comment on Re: sorting very large text files
Download Code
Re^2: sorting very large text files
by BrowserUk (Pope) on Jan 06, 2010 at 05:12 UTC
    I am surprised that this Perl script is only half the speed of Gnu sort ... Most of the time seems to be being spent in the output loop.

    Your file consists of 5 million 78-byte lines. If you assume that the basic block size of your disk is 4k, then:

    • when reading sequentially, even if the file was totally fragmented and required a seek and/or disk rotation between each read, then it requires 93,000 seeks & reads.

    • but when reading randomly, each block will need to be re-sought & read 52 times--because each 4k block contains 52 lines.

      So that's close to 5 million seeks & reads.

    Of course, it won't be quite that bad because your OS will attempt to keep blocks read in it's cache and thereby fulfill subsequent line read requests from previously read blocks from cache. But as the file gets bigger and the contents more disordered, then the odds of it being able to retain blocks long enough for all 52 line requests to be satisfied from 1 read drop off rapidly.

    This is further compounded by the fact that you are writing your records to the output file as you read them, forcing the disk to interleave seeks on the input file with seeks on the output file--unless your /opt directory is on another drive.

    You could probably improve the performance of your code on single disk systems by allocating a large buffer for the output, accumulate the records in that and then write it in one blast. If you try that, make sure that you overwrite the buffer not free and reallocate it.

    Better still would be to allocate two large buffers, accumulate in one, then switch to the other whilst the first is being written. It would require the use of threads or asynchronous IO to make best use of that.

    Ultimately, the fastest external sorts avoid random IO by reading sequential chunks of the file, sorting them in memory, and writing the results to temporary files. These are then merge sorted into the output file. Although the merge sort phase isn't strictly sequential because it requires seeks between the different temporary files, all the records from any one temporary file--and any given 4k block--will be accessed sequentially. That means the system only needs to cache one 4k block per temporary file to avoid having to re-seek & read any given block. A far more likely scenario than being able to cache all the active blocks in a huge file being read randomly.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://815874]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others surveying the Monastery: (10)
As of 2014-11-26 23:18 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My preferred Perl binaries come from:














    Results (176 votes), past polls