Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

Sort command equivalent in perl

by sandeepau (Initiate)
on Dec 15, 2011 at 16:17 UTC ( #943767=perlquestion: print w/replies, xml ) Need Help??
sandeepau has asked for the wisdom of the Perl Monks concerning the following question:

What is equivalent of following unix sort command in Perl? sort -T '\temp' input.txt> output.txt Can we store temporary files in other directories in Perl while performing sort?

Replies are listed 'Best First'.
Re: Sort command equivalent in perl
by MidLifeXis (Monsignor) on Dec 15, 2011 at 16:33 UTC

    Why do you think that perl's sort uses temporary files? If you need this type of sort, you will have to break the source data into parts, sort each part, and then merge them manually.

    --MidLifeXis

      Or use Sort::External


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

      The start of some sanity?

      I would like to sort very large size file around 5-10GB size. However using Perl Sort command, script get executed for very long time & get aborted due to out of memory issue. Hence, my assumption is it should store intermediate temporary files at the same directory. So, storing these temporary files some other location may resolve this issue. So, just checking if we can specify temporary directory location in sort command like unix.

        Perl's default sort routine is an in memory sort routine. The OS sort program is a file-based sort routine that can handle larger data sets. If you want Perl to do the same as the unix sort program, you will need to implement it your self (my comment above), or find someone else's implementation and use it (BrowserUK's suggestion above). There are no temp files used by the out-of-the-box sort routine in Perl.

        --MidLifeXis

Re: Sort command equivalent in perl
by TJPride (Pilgrim) on Dec 15, 2011 at 22:24 UTC
    Essentially, you'd code something that (a) splits your file into chunks small enough for Perl to load into memory and sort relatively efficiently and then (b) merges the files using a line-by-line method. Since I'm a glutton for punishment, here's a fully-working script I just spent the last half hour writing.

    perl process.pl in.txt out.txt ./temp 200

    use strict; use warnings; die "Arguments = in, out, temp dir, sort max in MB.\n" if $#ARGV != 3; my ($in, $out, $temp, $max) = @ARGV; die "$in does not exist.\n" if !-e $in; die "Can't open $in for read.\n" if !open(FH, $in); die "$temp does not exist, or is not a directory." if !-d $temp; $temp =~ s|/$||; $max *= 1024 * 1024; my (@b1, $size, $n, @t, $t1, $t2); $size = 0; $n = 0; while (<FH>) { push @b1, $_; $size += length $_; if ($size >= $max) { ### Over limit, write chunk writeTemp(); @b1 = (); $size = 0; } } ### Write whatever's left in buffer writeTemp() if $#b1 != -1; ### Using this so I don't have to write it twice in the code sub writeTemp { $n++; die "Unable to open $temp/$n.txt for write.\n" if !open (FHO, ">$temp/$n.txt"); @b1 = sort @b1; print FHO join('', @b1); print "$in => $temp/$n.txt ($size)\n"; } @t = (1..$n); while ($#t > 0) { $t1 = shift @t; $t2 = shift @t; $n++; mergeFiles("$temp/$t1.txt", "$temp/$t2.txt", "$temp/$n.txt"); print "$temp/$t1.txt + $temp/$t2.txt => $temp/$n.txt\n"; unlink "$temp/$t1.txt"; unlink "$temp/$t2.txt"; push @t, $n; } `mv $temp/$n.txt $out`; print "$temp/$n.txt => $out\n"; sub mergeFiles { my ($f1, $f2, $fo) = @_; die "Unable to open $f1 for read." if !open(FH1, $f1); die "Unable to open $f2 for read." if !open(FH2, $f2); die "Unable to open $fo for write." if !open(FHO, ">$fo"); my $l1 = <FH1>; my $l2 = <FH2>; while ($l1 && $l2) { if ($l1 lt $l2) { print FHO $l1; $l1 = <FH1>; } else { print FHO $l2; $l2 = <FH2>; } } local $/ = undef; if ($l1) { print FHO $l1; $l1 = <FH1>; print FHO $l1 if $l1; } else { print FHO $l2; $l2 = <FH2>; print FHO $l2 if $l2; } }

    Could probably be improved at a few points on read speed, but that shouldn't be the primary time sink.

    If we assume a GB of spare RAM and an input file of 20 GB, you might want to limit it to as little as 200 MB per file if the records are relatively short (one number, for instance), in which case you'd have 100 chunks and each chunk might take up to a few hundred seconds to process initially and be merged later on depending on how slow your computer is. A lot of time, but you can just leave it running for an hour or two and come back to it later.

Re: Sort command equivalent in perl
by davido (Cardinal) on Dec 16, 2011 at 16:43 UTC

    How about this?

    use strict; use warnings; use File::Copy; use Tie::File; copy( $ARGV[0], $ARGV[1] ) or die $!; my @file; tie @file, 'Tie::File', $ARGV[1] or die $!; @file = sort @file; untie @file;

    Tie::File has low memory requirements: The file's lines are treated as a disk-based array. File::Copy concisely copies the source file to a target filename. Then we tie the target file and sort it.

    I can't vouch for it being particularly fast; I suspect the OS version of sort is better optimized for disk-based sorting.

    Update: Ooh I hate when first instinct is misguided. Well, second instinct really, first instinct is to use the tools that are already written and proven (ie, the Linux sort utility). This muse fails as MidLife pointed out: The sort itself isn't done 'inline', and thus the memory usage will soar. And of course not even Evel Knievel could jump the valley separating the speed of the established tool versus this hack. :)


    Dave

      Will the sort still happen in memory? If it is not in memory, I do not see how this can do the @file = sort @file without data loss. If it is in memory, we are back to the original problem.

      +1 for the unique approach.

      --MidLifeXis

        You're right. it wouldn't work for files greater than memory.

        I thought it might for a while because of the in-place sort optimisation that came in somewhere in 5.8.x, that means that:

        @ar = sort @r;

        gets converted to sort \@ar; and sorts in place rather then copies the array to the stack and back.

        But looking at the code, it doesn't work for tied arrays:

        /* optimiser converts "@a = sort @a" to "sort \@a"; * in case of tied @a, pessimise: push (@a) onto stack, then assig +n * result back to @a at the end of this function */

        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        The start of some sanity?

      I can't vouch for it being particularly fast;

      That is really, really, really, really(*) S-L-O-W-!-!-!-!.

      (*)And I do mean re-ea-al-ll-ly sl-lo-ow! Like pre-climate change glacial man.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

      The start of some sanity?

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://943767]
Approved by toolic
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chanting in the Monastery: (2)
As of 2018-12-11 04:48 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    How many stories does it take before you've heard them all?







    Results (52 votes). Check out past polls.

    Notices?
    • (Sep 10, 2018 at 22:53 UTC) Welcome new users!