neversaint has asked for the wisdom of the Perl Monks concerning the following question:
Dear Masters,
This code below takes ATCG strings as input,
pack them into bit and then print them out again.
But somehow after decompressing it doesn't
give the original string. Why is that?
my %charmap = (
A => '00',
C => '01',
G => '10',
T => '11',
);
my %digmap = (
'00' => "A",
'01' => "C",
'10' => "G",
'11' => "T",
);
my $string = $ARGV[0] || 'GATTACCCC';
$string =~ s/(.)/$charmap{$1}/ge;
my $compressed = pack 'b*', $string;
print "COMP: $compressed\n";
printf "%d bytes\n", length
$compressed;
my @data;
push @data, $compressed;
# now use vec() to get at the single
# parts of $compressed
my $decompressed = unpack 'b*', $compressed;
$decompressed =~ s/(..)/$digmap{$1}/ge;
print "$decompressed\n";
---
neversaint and everlastingly indebted.......
Re: Unpack Fail to Decompress String?
by BrowserUk (Patriarch) on Jan 09, 2009 at 16:26 UTC
|
As oshalla pointed out, pack 'b*' pads the string to the nearest byte.
In order to know how much you need to trim off after you've unpack'd it, you need to also remember how many binary bits need trimming off the unpacked binary; 0, 2, 4, 6.
You can do that by prefixing the packed data with an extra byte that contains that information:
Update: My math was completely wrong and only worked for the example through blind luck. I've cleaned that up and wrapped the conversions in subroutines:
my %charmap = (
A => '00',
C => '01',
G => '10',
T => '11',
);
sub packACGT {
( my $in = shift ) =~ s/(.)/$charmap{$1}/ge;
print length $in, ':', length( $in ) % 8;
pack 'Cb*', ( 0, 6, 4, 2 )[ length( $in ) % 8 / 2 ], $in;
}
my %digmap = (
'00' => "A",
'01' => "C",
'10' => "G",
'11' => "T",
);
sub unpackACGT {
my( $len, $out ) = unpack 'Cb*', shift;
print $len;
substr( $out, -$len ) = '' if $len;
$out =~ s/(..)/$digmap{$1}/ge;
return $out;
}
my $string = $ARGV[0] || 'GATTACCCC';
print length( $string ), ':', $string;
my $compressed = packACGT( $string );
print length( $compressed ), ':', "'$compressed'";
my $decompressed = unpackACGT( $compressed );
print length( $decompressed ), ':', $decompressed;
It would be a nice enhancement to pack and unpack to allow the n/A* syntax to deal with this.
So that pack 'c/b*', ... and unpack 'c/b*', ... would work.
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [Watch: Dir/Any] [d/l] [select] |
Re: Unpack Fail to Decompress String?
by gone2015 (Deacon) on Jan 09, 2009 at 16:07 UTC
|
Your original string contains 9 characters. The form you are compressing to packs 4 characters to a byte, but with no way of knowing how many characters in the last byte are significant.
Update: FWIW you can speed this up considerably by banging 4 characters at a time -- code enclosed, below. However, in my experience a little bit of C will do a lot better -- particularly when reading big chunks of file and getting the C to process many strings at a time.
| [reply] [Watch: Dir/Any] [d/l] |
Re: Unpack Fail to Decompress String?
by setebos (Beadle) on Jan 09, 2009 at 16:29 UTC
|
Why would you bother with hash maps when you can:
$_ = unpack("b*", "ASD");
print(pack("b*", $_));
| [reply] [Watch: Dir/Any] [d/l] |
Re: Unpack Fail to Decompress String?
by jethro (Monsignor) on Jan 09, 2009 at 16:37 UTC
|
Solution might be to store the length modulo 4 in the first 2 bits of the first byte of the compressed string (or in the first 2 bits of the last byte). | [reply] [Watch: Dir/Any] |
|
I'd favour putting a count of the number of significant characters in the last byte in the last 2 bits of the that byte. (Which may require a zero last byte.)
The advantage is that it makes comparisions slightly easier...
strings which compress to the same number of bytes can be compared directly.
strings which compress to different numbers of bytes can be compared directly up to the last byte of the shorter. If that doesn't settle the matter, then you need to use the last bits of the shorter to select a mask...
| [reply] [Watch: Dir/Any] |
|
I'd favour putting a count of the number of significant characters in the last byte in the last 2 bits of the that byte. (Which may require a zero last byte.)
Sorry guys, but I don't see any way of making a trailing indicator allow for string comparison.
- If you do this way, the bigger strings sort before shorter strings:
input Using a significant bit count
in the last 2 bits of the last byte
ACGT 1b 00
ACGTA 1b 01
ACGTAA 1b 02
ACGTAAA 1b 03
ACGTAAAA 1b 00 00
ACGTAAAAA 1b 00 01
- This way is a non starter, things sort all over the place:
input Using the significant bit count
in the first byte
ACGT 00 1b 00
ACGTA 01 1b 00
ACGTAA 02 1b 00
ACGTAAA 03 1b 00
ACGTAAAA 00 1b 00 00
ACGTAAAAA 01 1b 00 00
- With a trailing full length, the last comes first and the first last, with those in the middle keeping their correct relative ordering:
input packed with trailing
network (BE) length
ACGT 1b 00 00 00 04
ACGTA 1b 00 00 00 00 05
ACGTAA 1b 00 00 00 00 06
ACGTAAA 1b 00 00 00 00 07
ACGTAAAA 1b 00 00 00 00 08
ACGTAAAAA 1b 00 00 00 00 00 09
The only method that I can see working is prefixing the output length in network (BE) order to the front:
input packed with leading
network (BE) length
ACGT 00 00 00 04 1b
ACGTA 00 00 00 05 1b 00
ACGTAA 00 00 00 06 1b 00
ACGTAAA 00 00 00 07 1b 00
ACGTAAAA 00 00 00 08 1b 00
ACGTAAAAA 00 00 00 09 1b 00
Update: Even that doesn't work! These two sort in the wrong order.
in: ACGTACGTACGTACGTATT packed( 9): 000000131b1b1b1b3c
in: ACGTACGTACGTACGTAAAA packed(10): 000000141b1b1b1b0000
And the penalty of that for what is intended as a compression mechanism is prohibitive for short sequences. You might lower the overhead by using a short instead of a long, but 64k is not enough for real genome work.
That said, the sorting and comparing of packed ACGT (other than simple equality), is a fairly non-useful thing anyway. Sequence and subsequence representations don't have any intrisic ordering.
The more frequent operation is to search one sequence for the presence of another (sub)sequence, and doing that with packed sequences means you're only checking every fourth position, rather than every position. Performing shifts or rotates on bitstring greater than register size is a prohibitively expensive operation. It far outweights any gains you might get from searching quarter sized strings, as you have to perform 4 searches anyway, and you need the expensive shift operation inbetween each search.
I think the only useful use for packed ACGT (2bit) format is to reduce storage overhead. It's almost certainly quicker to just unpack the sequences for searching. In which case, the prefix byte of the significant bits in the last byte is a good compromise between compression level and implementation ease.
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [Watch: Dir/Any] [d/l] [select] |
|
|
|
|
string length +1 modulo 4 is the same as the number of significant characters in the last byte. I.e. that is an implementation detail I left out on purpose.
If you mean comparision for equality, then putting the count into the first byte will be a lot faster for 75% of different-length strings and not much slower for same-length strings. Only if you want to know the first position at which they differ is the last byte better.
The advantage I see for last byte is that you can more easily append or cut from the end of the compressed string
| [reply] [Watch: Dir/Any] |
|
Re: Unpack Fail to Decompress String?
by diotalevi (Canon) on Jan 10, 2009 at 06:26 UTC
|
You don't need to make a bitstring. Replace input, chunks at a time.
my %chunkmap =
map { $_ => pack "b*", @charmap{split //,$_} }
map { glob }
map { ("{A,C,G,T}" x $_) . ("A"x(4-$_)) }
1..4;
$string =~ s/(.{1,4})/$chunkmap{$1}/g;
| [reply] [Watch: Dir/Any] [d/l] |
|
|