Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

Re: sorting very large text files

by Ea (Chaplain)
on Dec 21, 2009 at 09:47 UTC ( [id://813675]=note: print w/replies, xml ) Need Help??


in reply to sorting very large text files

As salva just said, I'd be tempted to change the way you access the data. Naively (because this is not my bread and butter), I'd either create an index file of field -> file+line number and sort on that or pre-sort them into similar files so that there's less work to do.

Not that these are particularly brilliant ideas, but I'm just trying to illustrate the point that there's more than one way to crack this nut and to look in different directions.

best of luck,

perl -e 'print qq(Just another Perl Hacker\n)' # where's the irony switch?

Replies are listed 'Best First'.
Re^2: sorting very large text files
by salva (Canon) on Dec 21, 2009 at 11:36 UTC
    I'd either create an index file of field -> file+line number and sort on that

    Unfortunately it is not as easy as that. If you sort an index containing just the sorting keys and the offsets, you will need a last step where you combine the sorted index with the original file to create the final full sorted output file.

    Doing this step straight ahead, just following the index and seeking into the original file to read every line, would be very, very, very inefficient. Roughly, (as estimated by BrowserUk) 165e6 records * 10ms per seek = 19 days!!!

    A work around is to create the final file in several passes reading the original file sequentially and generating an slice in every pass... not so easy!

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://813675]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others examining the Monastery: (7)
As of 2025-07-17 15:25 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?
    erzuuliAnonymous Monks are no longer allowed to use Super Search, due to an excessive use of this resource by robots.