Perl: the Markov chain saw PerlMonks

### Re^2: Data compression by 50% + : is it possible?

by LanX (Archbishop)
 on May 12, 2019 at 00:34 UTC ( #1233624=note: print w/replies, xml ) Need Help??

Supposing your input is correct and that it's truly random, than it should be possible to represent each line with ~ 7.356 bytes or ~ 59 bits.

You have 9 groups with 0-3 numbers in the range 2..9.

I.e each group can be represented with a byte with at most 3 bits set.

There are only 93=56+28+8+1 such combinations possible.

ln(93*9)/ln(256)= 7.35655366 bytes per line

At the moment you'll need -2.5 characters per group which results in -22.5 char per line. (56*3+28*2+8*1)/93

That's about one third.So even with a non binary representation you should achieve your 50 percent or better.

This can only be improved if the combinations don't have the same likelihood.

I don't wanna dig deeper because I don't trust your code and smell an xy problem here.

##### Update

I just realised that you are forbidding consecutive numbers in your if condition. I.e (2,3,9) is never possible.

This will change the math, but the approach is the same.

Roboticus said you need 15 char in average 7.4 bytes per line is just an upper boundary, so 50% is easily reached.

Don't wanna calculate it again! This would be needed to be done programmatically.

(But I don't trust your code anyway ;)

Cheers Rolf
(addicted to the Perl Programming Language :)
Wikisyntax for the Monastery FootballPerl is like chess, only without the dice

• Comment on Re^2: Data compression by 50% + : is it possible?

Replies are listed 'Best First'.
Re^3: Data compression by 50% + : is it possible?
by LanX (Archbishop) on May 12, 2019 at 02:26 UTC
Instead of calculating I wrote a little script counting the probabilities of group combinations.

Roboticus was right, you need about 14,05 characters per line plus "\n"

Only 1 + 8 + 21 + 20 = 50 combinations are possible per group, resulting in a conservative compression of 50.79 bits = 6.34 bytes per line, which already means a compression to 42% = 58% win.

But if you look into the likelihood of those combinations you clearly see that a Huffman encoding would result in an even better ratio.

I think that's then near the theoretical optimum. (further reading Huffman_coding_with_unequal_letter_costs )

All this supposing your input was real... ;-)

```
use strict;
use warnings;
use Data::Dump qw/pp/;

my %count;

for my \$c0 (0..9){
for my \$c1 (0..9){
for my \$c2 (0..9){
for my \$c3 (0..9) {
my @c = sort {\$a <=>\$b} (\$c0,\$c1,\$c2,\$c3);
#print "@c\t:\t";
my @allowed;
for my \$i (1..3) {
if ( \$c[\$i] != \$c[\$i-1]    &&   \$c[\$i] != \$c[\$i-1]+1 )
+{
push @allowed, \$c[\$i]
}
}
#print "@allowed\n";
\$count{join "",@allowed}++
}
}
}
}

my @length;
my \$average;
for my \$k (keys %count){
my \$len = length \$k;
\$length[\$len]++;
\$average+= \$len* \$count{\$k}/10000;
}

warn '@length: ', pp \@length;
warn 'average #characters: group/line',pp [\$average,\$average*9];

my  \$combies =0;
\$combies+= \$length[\$_] for 0..3;
#\$combies=93;

warn "# possible combinations: ", \$combies;

my \$upper_bound= log(\$combies)/log(2)*9;
warn 'Upper bound bits, bytes', pp [\$upper_bound, \$upper_bound/8];

#warn "ranking", pp [ sort {\$b <=>\$a} values %count ];

warn 'probabilities: ',pp \%count;

```@length: [1, 8, 21, 20] at /tmp/compress.pl line 36.
average #characters: group/line[1.5624, 14.0616] at /tmp/compress.pl l
+ine 37.
# possible combinations: 50 at /tmp/compress.pl line 44.
Upper bound bits, bytes[50.7947057079725, 6.34933821349656] at /tmp/co
+mpress.pl line 48.
probabilities: {
""    => 592,
"2"   => 74,
"24"  => 60,
"246" => 24,
"247" => 24,
"248" => 24,
"249" => 24,
"25"  => 84,
"257" => 24,
"258" => 24,
"259" => 24,
"26"  => 84,
"268" => 24,
"269" => 24,
"27"  => 84,
"279" => 24,
"28"  => 84,
"29"  => 60,
"3"   => 208,
"35"  => 144,
"357" => 48,
"358" => 48,
"359" => 48,
"36"  => 192,
"368" => 48,
"369" => 48,
"37"  => 192,
"379" => 48,
"38"  => 192,
"39"  => 144,
"4"   => 366,
"46"  => 228,
"468" => 72,
"469" => 72,
"47"  => 300,
"479" => 72,
"48"  => 300,
"49"  => 228,
"5"   => 524,
"57"  => 312,
"579" => 96,
"58"  => 408,
"59"  => 312,
"6"   => 682,
"68"  => 396,
"69"  => 396,
"7"   => 840,
"79"  => 336,
"8"   => 830,
"9"   => 508,
} at /tmp/compress.pl line 52.

Cheers Rolf
(addicted to the Perl Programming Language :)
Wikisyntax for the Monastery FootballPerl is like chess, only without the dice

##### update

I just realized that Roboticus already had the same basic ideas here: Re: Data compression by 50% + : is it possible?

Create A New User
Node Status?
node history
Node Type: note [id://1233624]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others perusing the Monastery: (4)
As of 2019-06-25 23:36 GMT
Sections?
Information?
Find Nodes?
Leftovers?
Voting Booth?
Is there a future for codeless software?

Results (108 votes). Check out past polls.

Notices?