Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

Sorting a (very) large file

by condar (Initiate)
on Nov 30, 2007 at 19:14 UTC ( #654173=perlquestion: print w/ replies, xml ) Need Help??
condar has asked for the wisdom of the Perl Monks concerning the following question:

Greetings all,

I'm still a relative Perl newbie, and most of my experience/past projects have been simple parsers. Right now, I have a file listing contained in a file in the following format:

Date (MM-DD-YYYY) \t Time (HH:MM PM) \t Size (in bytes, no formatting) \t Full file path (Directory\filename)

So a sample record would be like:

12-25-2005     12:30 PM     350234     C:\someplace\somefile.txt

What I want to do is sort the list by size, preferably in descending order. I've tried the sort function, and a ST sort, but I get an error on my system that I've run out of memory. This may be because the file I'm using is approximately 400 MB, and I only have 1 GB of memory available.

Any help will be most appreciated. Thank you!

For reference, this is what was used for the last attempt:

@in = <INPUTFILE>; @out = map $_->[0] => sort { $a->[1] <=> $b->[1] } map [$_, (split(/\t/))[2]] => @in;

Comment on Sorting a (very) large file
Select or Download Code
Re: Sorting a (very) large file
by moritz (Cardinal) on Nov 30, 2007 at 19:31 UTC
    There are three things you could try.:

    1) The one that will always work is to split up the file in smaller chunks, sort the separately, and then merge them again.

    2) If you work on a unixish system, you could try to abuse you system's sort program, at least GNU sort handles memory limitations automatically by creating temporary files.

    3) Search on CPAN. Maybe there's a module that does what you want? File::Sort looks promising at a first glance (actually CPAN should be your first place to look ;-)

      3) Search on CPAN. Maybe there's a module that does what you want?

      Sort::External

        Y'know, Sort::External is designed for exactly what the original poster wants to do... but few people seem to find it on CPAN.

        Sort::External was my first CPAN distro, and the people on the modules mailing list prevailed upon me to release it it under the academically correct name. I occasionally regret having taken that advice.

        FWIW, here's the definition of "external sort" from nist.gov:

        Any sort algorithm that uses external memory, such as tape or disk, during the sort. Since most common sort algorithms assume high-speed random access to all intermediate memory, they are unsuitable if the values to be sorted don't fit in main memory.
        --
        Marvin Humphrey
        Rectangular Research ― http://www.rectangular.com
Re: Sorting a (very) large file
by jrsimmon (Hermit) on Nov 30, 2007 at 19:42 UTC
    The example you posted appears to be the output of a "dir" command from a dos-like system. Have you considered sorting the data as you capture it ("dir /OS" on winXP)?
Re: Sorting a (very) large file
by samtregar (Abbot) on Nov 30, 2007 at 19:47 UTC
    You probably do have enough memory to sort the file, but you need to be careful about how you handle it. Doing a Schwartzian transform on it like that is creating huge temporary copies, which is probably what's blowing through your memory. You might try something slower that uses less memory:

    @lines = <INPUTFILE>; @lines = sort { (split($a, /\t/))[2] <=> (split($b, /\t/))[2] } @li +nes;

    I believe the most recent Perls have a special in-place sort() optimization when you're sorting an array and assigning it back to itself. (UPDATE: Yup, it was added in v5.8.4)

    -sam

      Or you could try something twice as fast that uses much less memory. In this case I'd likely sort parallel arrays (though it isn't too complicated to do something even faster that uses even less memory, such as fast, flexible, stable sort).

      my @size= map { ( split /\t/, $_ )[2] } @in; my @idx= sort { $size[$a] <=> $size[$b] }, 0..$#in; @in= @in[@idx];

      And I'd check how much paging space is configured. It sounds like that could be increased.

      - tye        

        I really doubt that's going to work in 1GB of RAM on a 400MB input (and don't worry about paging space - if you start paging then any speedup you gained just went bye-bye). I guess it's possible, but this line looks pretty bad to me:

           @in = @in[@idx];

        Or do you happen to know that Perl does that in-place on @in? I guess the only way to really know would be to try it, but unfortunately I do have some real work to do...

        -sam

Re: Sorting a (very) large file
by Limbic~Region (Chancellor) on Nov 30, 2007 at 19:50 UTC
    condar,
    There have been some ideas already from the chatterbox. Here is an idea that I had while I was thinking about it:
    #!/usr/bin/perl use strict; use warnings; my %data; my $file = $ARGV[0] or die "Usage: $0 <input_file>"; open(my $fh, '<', $file) or die "Unable to open '$file' for reading: $ +!"; my $pos = tell $fh; while (<$fh>) { my $size = (split /\t/)[2]; $data{$size} = exists $data{$size} ? $data{$size} . ":$pos" : $pos; $pos = tell $fh; } for my $size (sort {$b <=> $a} keys %date) { for my $pos (split /:/, $data{$size}) { seek($fh, $pos, 0); my $line = <$fh>; print $line; } }

    Note: This code is untested an is not optimal. It was just an idea I had that you might want to try. It will output lines of files with the same size in the same order they appeared in the original file.

    Cheers - L~R

    Inlined get_size() sub which was just a call to split. Also specified keys %date
Re: Sorting a (very) large file
by metaperl (Curate) on Nov 30, 2007 at 19:51 UTC
    I would toss the data in a SQLITE database and let it do the dirty work.
    I have beheld the tarball of 22.1 on ftp.gnu.org with my own eyes. How can you say that there is no God in the Church of Emacs? -- David Kastrup [tag://sort,etl,data]
Re: Sorting a (very) large file
by runrig (Abbot) on Nov 30, 2007 at 20:32 UTC
    moritz already mentioned command line sort...here's how to do it (if you are on windows( as jrsimmon believes), go get MSYS):
    > # The file is tab delimited (your time field having spaces shouldn't + matter) > cat tmp.txt 3 1a 2 b 2 2c 3 b 1 3b 1 b > # The char between the quotes is a tab ("\t" works in ksh) > sort -t" " -k3n tmp.txt 1 3b 1 b 3 1a 2 b 2 2c 3 b >
    Update: I do not claim that the above quotes are properly escaped for Windows. Actually I'm not even sure how to escape them in the cmd shell, but this works from perl:
    system('sort',"-t", "\t","-k3n","tmp.txt");
Re: Sorting a (very) large file
by sundialsvc4 (Abbot) on Nov 30, 2007 at 21:22 UTC

    400MB is not a “very large” file. It is “not a very large file at all.”

    What you need to do is use a disk-based sort. There are many of these in CPAN, such as File::Sort. Or, how aboutSort::External? As it happens, “sorting” is one of the most extensively-studied and extensively-implemented algorithms in all of computer science. (The other is “searching,” saith the august Dr. Knuth.) My CPAN-search found 1,929 choices under “Sort.”

    Although virtual-memory sorting is simple and easy to conceptualize, it is not intended for large volumes of data. You're seeing it choke, and fail, trying to do a task that is much larger than it was designed for.

    Disk-oriented tools, of which there are a great many available, will automatically handle the job by splitting the total job into a group of disk-based spill files, then merging those files together for you. An arbitrary amount of data can be sorted in this way and it will be quite fast.

    (Heck, you can even sort data using nothing more than magnetic tape! When all those campy old sci-fi movies needed to film “a computer,” the operators would obligingly start a tape-sort.)

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://654173]
Approved by moritz
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others romping around the Monastery: (8)
As of 2014-12-22 02:24 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (110 votes), past polls