Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Compressing String for Printing

by neversaint (Deacon)
on Dec 25, 2008 at 15:37 UTC ( #732588=perlquestion: print w/replies, xml ) Need Help??

neversaint has asked for the wisdom of the Perl Monks concerning the following question:

Dear Masters,

What's the Perl's way to compress the string into smaller 2~3 char/digit length.
In particular I want to compress string of length >=30 characters, e.g.
ACGATACGGCGACCACCGAGATCTACACTCTTCC
The reason I want to do that is because, there are millions of such string I want to print out (after some processing). And I need to save disk space.

---
neversaint and everlastingly indebted.......

Replies are listed 'Best First'.
Re: Compressing String for Printing
by Corion (Pope) on Dec 25, 2008 at 16:22 UTC

    What have you tried so far? Have you looked at the gzip or bzip2 programs? You can compress your file with them and then read from your file by opening a pipe to them:

    my $packer = 'bzip2'; my $file = 'data.txt.bz2'; open my $fh, "$packer -cd $file |" or die "Couldn't decompress '$file': $!/$?";

    Alternatively, you could encode each of the four characters into two bits, thus storing four characters per byte. I guess this approach won't be more efficient space-wise than the gzip or bzip2 approach, but it retains the ability to do random reading in your file:

    use strict; my %charmap = ( A => '00', C => '01', G => '10', T => '11', ); my $string = 'GATTACA'; $string =~ s/(.)/$charmap{$1}/ge; print "$string\n"; my $compressed = pack 'b*', $string; print "$compressed\n"; printf "%d bytes\n", length $compressed; # now use vec() to get at the single parts of $compressed my $decompressed = unpack 'b*', $compressed; print "$decompressed\n";

    But have you looked at BioPerl? I'm pretty sure that they have support for that stuff.

      Corion's suggestion to use a compression program is a good suggestion. If the sequence represented by the strings are from coding regions, it is likely that some sub-sequences (codons) occur with much higher frequency than other sub-sequences (e.g., stop codons). In this case, certain types of compression algorithms can potentially achieve better compression than the 4:1 you'd get with bit packing.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://732588]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others studying the Monastery: (11)
As of 2020-02-28 16:50 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    What numbers are you going to focus on primarily in 2020?










    Results (125 votes). Check out past polls.

    Notices?