Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

Sorting Gigabytes of Strings Without Storing Them

by neversaint (Deacon)
on Dec 22, 2008 at 14:59 UTC ( #732102=perlquestion: print w/replies, xml ) Need Help??

neversaint has asked for the wisdom of the Perl Monks concerning the following question:

Dear Masters,
I have a set of strings with upto 2GB lines.
Is there an memory efficient way to sort these strings?
# Strings below up to > 2GB lines __DATA__ AAACGAGAAGTAATATCAGTATCGTATGCTTCAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAC AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA TTTTCGCCTTCCGGGTGCCGCTGGGGTCTTTCTC GCACACTCGGAGCCCGGGGAGCCAGAGGAAACAA GATCATGACACAGTTGATAAAATTGTTGTTCAGA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACC AAACGAGAAGTAATATCAGTATCGTATGCTTCAC AAACGAGAAGTAATATCAGTATCGTATGCTTCGA AAACGAGAAGTAATATCAGTATCGTATGTTTCAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAATA
I can always stored them in array then apply the standard sort function. But I have memory problem with that classical approach.

---
neversaint and everlastingly indebted.......

Replies are listed 'Best First'.
Re: Sorting Gigabytes of Strings Without Storing Them
by moritz (Cardinal) on Dec 22, 2008 at 15:07 UTC
    The standard unix sort command automatically writes temp files when its memory usage increases too much, maybe that works for you.

    You could also take a look at Sort::External. From its documentation:

    Problem: You have a list which is too big to sort in-memory.

    Solution: "feed, finish, and fetch" with Sort::External, the closest thing to a drop-in replacement for Perl's sort() function when dealing with unmanageably large lists.

Re: Sorting Gigabytes of Strings Without Storing Them
by DStaal (Chaplain) on Dec 22, 2008 at 15:05 UTC

    Honestly, the best way if you are unix/linux is the standard 'sort' utility.

    And I have no idea what you are doing, but if you haven't done so already, take a look at Bioperl

Re: Sorting Gigabytes of Strings Without Storing Them
by tilly (Archbishop) on Dec 22, 2008 at 16:26 UTC
    I have to disagree with everyone. Standard "lists are too large" approaches such as Sort::External are coded to handle the case of too many lines, not lines that are too large. I would expect this to be true of the Unix sort utility as well, though it makes sense to try it to verify that.

    Assuming that that doesn't work, what you need to do is use read to scan the file to find all of the starts of lines (they are 1 after the returns). Then you sort this array of offsets with a comparison function that does a string comparison between what is in the file at the two different locations. You of course don't want to pull in the full line to do that, instead use seek and read to move to the the right spots, and pulls a block each, then compares them, repeating as need be. Then you take this sorted array and use it to write out a sorted file by taking each offset, extracting the string in pieces and writing it to the output.

    This is kind of complex and I'd code you an explanatory example if I was not currently typing one-handed due to a hand injury. :-(

    Update: As BrowserUk notes, your question is unclear. I read it as lines up to 2 GB. If that is wrong, then this answer is inappropriate.

Re: Sorting Gigabytes of Strings Without Storing Them
by BrowserUk (Pope) on Dec 22, 2008 at 17:24 UTC
    I have a set of strings with upto 2GB lines.

    Is that

    1. Individual lines of 2 GB characters?
    2. Or 2GB of data split into lines of ~34 charcters?
    3. Or 2 billion lines each of ~34 characters?

    An appropriate solution could be dependant upon the distinction.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Sorting Gigabytes of Strings Without Storing Them
by grinder (Bishop) on Dec 22, 2008 at 21:07 UTC

    One approach to save memory would be pack the strings using a custom mapping (A => 0x00, C => 0x01, G => 0x10, T => 0x11). This will allow you to store 4 characters in an 8-bit byte, thus consuming 9 bytes instead of 34. You probably want to do this anyway even if you sort with an external utility, since you'll improve bandwidth efficiencies the longer you can maintain the data in this compressed format.

    Sadly your lengths are just two characters too many to hold everything in a nice 64-bit quantity.

    • another intruder with the mooring in the heart of the Perl

Re: Sorting Gigabytes of Strings Without Storing Them
by salva (Abbot) on Dec 23, 2008 at 12:30 UTC
    Pack the data into a big vector and then sort it inplace using Sort::Packed. Memory requirements will get reduced to ~1/4 of your data set.
    #!/usr/bin/perl use strict; use warnings; use Sort::Packed qw(sort_packed); my $packed = ''; my ($len); my %val = (A => 0, C => 1, G => 2, T => 3); my %rev = reverse %val; sub compress { my @data = split //, scalar(reverse shift); my $out = ''; for (my $i = 0; $i < @data; $i++) { my $bits = $val{$data[$i]}; defined $bits or die "bad data"; vec($out, $i, 2) = $bits; } scalar reverse $out } sub decompress { my $data = reverse shift; my $len = shift; my $out; for (my $i = 0; $i< $len; $i++) { $out .= $rev{vec($data, $i, 2)} } scalar reverse $out; } while(<DATA>) { chomp; ($len ||= length) == length or die "bad data"; $packed .= compress $_; } my $bytes = int(($len * 2 + 7) / 8); my $n = length($packed) / $bytes; sort_packed "C$bytes" => $packed; for (my $i = 0; $i < $n; $i++) { print decompress(substr($packed, $i * $bytes, $bytes), $len), "\n" +; } __DATA__ AAACGAGAAGTAATATCAGTATCGTATGCTTCAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAC AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA TTTTCGCCTTCCGGGTGCCGCTGGGGTCTTTCTC GCACACTCGGAGCCCGGGGAGCCAGAGGAAACAA GATCATGACACAGTTGATAAAATTGTTGTTCAGA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACC AAACGAGAAGTAATATCAGTATCGTATGCTTCAC AAACGAGAAGTAATATCAGTATCGTATGCTTCGA AAACGAGAAGTAATATCAGTATCGTATGTTTCAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAATA
    Preallocating the vector will also further reduce memory comsumption.

      The problem with that is it takes far longer ( 3 1/2 minutes vs. 10 seconds) than using the system sort utility:

      [12:47:29.37] c:\test>junk4 ACGT34.dat >nul [12:50:49.45] c:\test> [12:53:00.73] c:\test>sort ACGT34.dat /O nul [12:53:09.51] c:\test>

      And that is just a million 34-char strings, not "2GB". Admittedly, most of the time is spent reading, packing and unpacking the data, rather than in your fine sort routine which takes less than half a second to do its work.

      Actually, part of problem is building up that big scalar piecemeal. If you watch the memory usage as that while loop repeatedly expands the $packed, you'll see something like this. Those transient spikes in the memory usage are where Perl/CRT has to go to the OS to grab ever larger chunks of ram into which to copy the slowly expanding scalar, and then frees of the old chunk once it has copied it over. That constant allocation, reallocation and copying really hampers the in-memory approach.

      You can knock around 30 seconds off the 3.5 minutes by preallocating the memory. In this version of your code, I use a ram file to allocate a chunk bigger than required, populate it by writing to the ram file, and then truncate the string to its final size using chop which avoids the seesaw affect on the memory allocation:

      #!/usr/bin/perl use strict; use warnings; use Sort::Packed qw(sort_packed); my $packed = ''; open RAM, '>', \$packed; seek RAM, 10e6, 0; print RAM chr(0); seek RAM, 0, 0; my ($len); my %val = (A => 0, C => 1, G => 2, T => 3); my %rev = reverse %val; sub compress { my @data = split //, scalar(reverse shift); my $out = ''; for (my $i = 0; $i < @data; $i++) { my $bits = $val{$data[$i]}; defined $bits or die "bad data"; vec($out, $i, 2) = $bits; } scalar reverse $out } sub decompress { my $data = reverse shift; my $len = shift; my $out; for (my $i = 0; $i< $len; $i++) { $out .= $rev{vec($data, $i, 2)} } scalar reverse $out; } while(<>) { chomp; ($len ||= length) == length or die "bad data"; print RAM compress $_; } close RAM; chop $packed while length( $packed ) > $. *9; my $bytes = int(($len * 2 + 7) / 8); my $n = length($packed) / $bytes; sort_packed "C$bytes" => $packed; for (my $i = 0; $i < $n; $i++) { print decompress(substr($packed, $i * $bytes, $bytes), $len), "\n" +; } __END__ [13:50:49.29] c:\test>junk4 ACGT34.dat >nul [13:53:44.53] c:\test>

      The result is that the overall time taken is reduced to just under 3 minutes, which leaves the bulk of the time spent packing and unpacking the data. And I cannot see any easy way of speeding that up.

      Maybe Perl needs a pack template for dealing with genomic data? Trouble is, there are several variations. As well as ACGT, they also use forms which contain 'N' 'X' and a few other characters for different situations.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

        Just off the top of my head, I'd probably try replacing the 'pack' stuff by producing whole bytes at a time. Something like:

        use Algorithm::Loops qw< NestedLoops >; my @bases= qw< A C G T >; my @quad= map { join '', @$_ } NestedLoops( [ ( \@bases ) x 4 ] ); my %byte; @byte{ @quad }= map pack("C",$_), 0 .. %#quad; my $carry= ''; while( <> ) { chomp; substr( $_, 0, 0, $carry ); my $pack= ''; s/(....)/ $pack .= $byte{$1}; '' /g; $carry= $_; print RAM $pack; } print RAM $byte{ substr( $carry . 'AAA', 0, 4 ) } if $carry;

        But I haven't looked at the rest of this thread recently nor actually tried my suggestions. I can think of lots of different ways to pull out 4 bases at a time and some ways might have a noticeable impact on speed.

        - tye        

Re: Sorting Gigabytes of Strings Without Storing Them
by eye (Chaplain) on Dec 23, 2008 at 08:43 UTC
    Not to be argumentative, but are you sure you need to sort this list?

    If your goal were to identify duplicates in the list, you could take a hash (MD5, not %hash) of each list element, sort the hashes, and look for duplicates in the hash space. A hash function like MD5 or SHA1 should reliably distinguish 2 GB strings, but if you do find duplicates, you could always verify them from the primary data.

    If these strings are expected to be dissimilar, it may suffice to sort them based on their first 1000 characters and then use a separate procedure to deal with cases where the first 1000 characters are identical.

    This seems like a time to step back and think about the larger goal and any a priori knowledge of the data to be sorted.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://732102]
Approved by moritz
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others cooling their heels in the Monastery: (4)
As of 2020-02-17 10:08 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    What numbers are you going to focus on primarily in 2020?










    Results (71 votes). Check out past polls.

    Notices?