Don't ask to ask, just ask  
PerlMonks 
Re: Data compression by 50% + : is it possible?by davido (Cardinal) 
on May 12, 2019 at 02:24 UTC ( #1233628=note: print w/replies, xml )  Need Help?? 
If no number can be repeated more than once per line then each line can be represented in 90 bits. Bit 0 represents 1, bit 1 represents 2, and bit 89 (the 90th bit) represents 90. A 1 in bit 19 means that there was a 20 in the data set. You said you don't need to preserve order (it would be cool, right but not a requirement). Reconstructing the data set will provide the values in numeric order. Here's a proof of concept:
Each 80 column line could not take more than 96 bits (we waste 6 just to keep it simple). Update: If the OP's data looks like what is generated (but with the additional constraint of not having the same value more than one time on a given line) this solution works, but may not achieve an adequate level of compression. Essentially, the longer the line, the better the ratio. The longest any line could be without repeating any ordinal value from 1 to 90 would be 90 characters. In that case, we can achieve almost an 7.5:1 compression ratio using a bit vector and throwing order out the window. But the shorter the input line, the lower the ratio would be. If the line averages twelve characters we have no compression, but still lose order. If the line averages ten characters, we have expansion. So for a 2:1 compression ratio the average line length would need to be 24 characters using my technique. Here's a breakdown: length  original bit size  stored bit size  compression ratio  1  8  96  1:12 (expansion) 4  32  96  1:3 (expansion) 6  48  96  1:2 (expansion) 12  96  96  1:1 14  112  96  1.16:1 (14% compression) 24  192  96  2:1 (50% compression) 36  288  96  3:1 48  384  96  4:1 (75% compression) 60  480  96  5:1 72  576  96  6:1 84  672  96  7:1 90(max) 720  96  7.5:1 So breakeven is 12 values. 50% size reduction comes at 24 values. 75% size reduction comes at 48 values per line. This solution will only meet the OP's requirement of a 50% size reduction if the average line length is values. As an artifact of using 96 bits to represent 90 values, we have six bits left over to use to encode newline, and even undef, if we want. Also, if the OP's data set does contain newlines (which we can encode as bit 90, the newline gets a 50% reduction, but still fits within the 96 bits, so serves to improve our compression ratio. If the average line length is 12, plus a newline (which is two characters), we are compressing 14 characters (112 bits) into 96 bits of space, so 1.16:1 compression. Nothing to write home about. Dave
In Section
Seekers of Perl Wisdom

