Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

Re: Data compression by 50% + : is it possible?

by roboticus (Chancellor)
on May 11, 2019 at 23:09 UTC ( #1233619=note: print w/replies, xml ) Need Help??


in reply to Data compression by 50% + : is it possible?

baxy77bax:

That depends on how much redundant information there is in your data. The more redundancy you have, the easier it is to squeeze it.

So the first thing you ought to do is find out how random your data appears to be. I rand your script a few times, and it seems to generate about 15 characters per line.

I did a little hacking on your inner loop and discovered that due to your if statements in the encoding of your array @c, there are only 50 different possibilities for it, which you could encode easily in 1 character (six bits). Since you've got 9 iterations of that loop, each line could be encoded as 9 characters. That's not quite enough to crunch out half the space, but it should get you a good start.

The 50 different possibilities have a fairly non-uniform distribution, where the most common 4 should appear 25% of the time, and the most common 9 appear over 50% of the time, so there's definitely some room to squeeze out even more space.

use strict; use warnings; my %cnt; # Generate all possibilities: for my $a1 (0 .. 9) { for my $a2 (0 .. 9) { for my $a3 (0 .. 9) { for my $a4 (0 .. 9) { my @c = sort ($a1, $a2, $a3, $a4); my $s = toStr(@c); $cnt{$s}++; } } } } my $ttl = 0; print <<EOHDR; num pct ttl ttl pct offsets to print ----- ------- ----- ------- ---------------- EOHDR for my $k (sort { $cnt{$b} <=> $cnt{$a} } keys %cnt) { $ttl += $cnt{$k}; printf "%5u %6.2f %5u %6.2f <%s>\n", $cnt{$k}, 100*$cnt{$k}/10000.0, $ttl, 100*$ttl/10000.0, $k; } sub toStr { my @c = @_; my @rv = (); for (my $i=1; $i < @c; $i++) { if ($c[$i] != $c[$i-1] && $c[$i] != $c[$i-1]+1) { push @rv, $c[$i]; } } return join(":",@rv); }

When I ran it, it shows:

$ perl funky.pl num pct ttl ttl pct offsets to print ----- ------- ----- ------- ---------------- 840 8.40 840 8.40 <7> 830 8.30 1670 16.70 <8> 682 6.82 2352 23.52 <6> 592 5.92 2944 29.44 <> 524 5.24 3468 34.68 <5> 508 5.08 3976 39.76 <9> 408 4.08 4384 43.84 <5:8> 396 3.96 4780 47.80 <6:9> 396 3.96 5176 51.76 <6:8> 366 3.66 5542 55.42 <4> 336 3.36 5878 58.78 <7:9> 312 3.12 6190 61.90 <5:7>

What the table means is that the most common result would be that the array @c would yield chr(33 + 7 + $x), and would do so 8.4% of the time. The fourth row shows that the array @c would print nothing at all 5.92% of the time. The ninth row shows that we would print chr(33 + 6 + $x), chr(33 + 8 + $x) 3.96% of the time.

By using the table and encoding the values in 9 characters, you could also omit the "\n" and read the data as fixed length records, if you don't want to try to find out a method to squeeze out a bit more. I'd suggest reading Huffman_coding and/or Arithmetic_coding for other ideas.

Edit: tweaked a phrase, fixed the links to wikipedia articles.

...roboticus

When your only tool is a hammer, all problems look like your thumb.

Replies are listed 'Best First'.
Re^2: Data compression by 50% + : is it possible?
by LanX (Archbishop) on May 12, 2019 at 10:10 UTC
    Hello Roboticus,

    I just realized that we had the same ideas (me just later ;).

    One difference is that you want to encode each of the 9 groups individually with 6 bits each => 6*9=54 bits per line while I'm encoding a whole line as a polynom

    Sum($g($i) * 50**$i) with $i=0 .. 8

    resulting in the need of 51 bits per line.

    But I don't understand why you say

    >  That's not quite enough to crunch out half the space

    The old encoding needs per line in average >14 bytes plus newline.

    That's > 15 * 8 = 120 bits

    So your 54 bits need already less than half the space!

    What am I missing?

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery FootballPerl is like chess, only without the dice

      LanX:

      You're not missing anything that I know of. What I was basing my "not quite" phrasing on is the idea of using a single character to encode each group (@c) into a character, so it would use 9 characters (72) bits. Had I thought of just packing the required 51 bit records together, it would be more than sufficient to get 50% compression, as the file would take 635 bytes to encode 100 records (sans newlines).

      (The 51 bits came from: 50 different possibilities for each group of 10 in the inner loop (log2(50) == 5.64.. bits/group) * (9 groups) == 50.79 bits.)

      ...roboticus

      When your only tool is a hammer, all problems look like your thumb.

        >  using a single character to encode each group (@c) into a character, so it would use 9 characters (72) bits

        Oh I see, but I hope you are aware that your approach can easily be packed into 9*6=54 bits and is easier to code than mine.

        With a per line ratio of 7 bytes = 56 bits you'll already have a 50%+ compression.

        My approach would require modulo 50 calculations on 51 bit integers, not sure how tricky this is on a 32 bit machine.

        So I'd rather "waste" 5 bits/line for a pragmatic solution.

        Btw: I'm reluctant trying to implement Huffman here, because the OP could probably change the parameters in his next post.

        Cheers Rolf
        (addicted to the Perl Programming Language :)
        Wikisyntax for the Monastery FootballPerl is like chess, only without the dice

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1233619]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others cooling their heels in the Monastery: (5)
As of 2019-06-25 00:04 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Is there a future for codeless software?



    Results (100 votes). Check out past polls.

    Notices?