Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

compression challenge

by cLive ;-) (Parson)
on Apr 25, 2001 at 02:18 UTC ( #75290=perlmeditation: print w/ replies, xml ) Need Help??

I've put this here, coz I'm not sure which section to post in....

Just spent the worst part of an hour on this, only to spot an obvious problem - chances of matching n bytes is 256^n, not 256^(n-1). D'oh ;-)

So this is NOT A SUGGESTED SOLUTION!, just my thought processes before I went wrong...

The challenge, discussed on /.:

    I will attach a prize of $5,000 to anyone who successfully meets this
    challenge.  First, the contestant will tell me HOW LONG of a data file to
    generate.  Second, I will generate the data file, and send it to the
    contestant.  Last, the contestant will send me a decompressor and a
    compressed file, which will together total in size less than the original
    data file, and which will be able to restore the compressed file to the
    original state.

    With this offer, you can tune your algorithm to my data.  You tell me the
    parameters of size in advance.  All I get to do is arrange the bits within
    my file according to the dictates of my whim.  As a processing fee, I will
    require an advance deposit of $100 from any contestant.  This deposit is
    100% refundable if you meet the challenge.

Although it would never get paid (see this :), I couldn't help but wonder. My thoughts...

Compressor:

parse the file starting at byte $i=0

search for $n byte repetitions for $n>1, existing when $i is between 256**($n-2) and 256**($n-1) - 1 inclusive

splice these out of the data and store a placeholder (ie $i) to show where to resplice bytes to.

So, the matches reduce storage space as follows:

               repeat 
position       match    storage

0-255          2 bytes  1 byte
256-65535      3 bytes  2 bytes
65536-16777215 4 bytes  3 bytes
etc

delimit matches of different byte size with a single delimiter, ie:

single byte matches=double byte matches=etc

determine the delimiter based on lowest number of occurences in matches (as these would have to be deleted)

So for a file of length $i and $m matches (not containing delimiter),

bytes saved = $m - (int(log(base 256)$i)

possibly minus another one - can't remember my logs :)

Only problem is that this saves a negative amount on average...

Decompressor:

Take the values stored, and parse. Here's a wordy example to illustrate how I'd decompress... my @matches = split '=', 'single byte matches=double byte matches=etc'; my $i=0; for (@matches) { while(substr($_,0,$i+1,'')) { ... (un?)pack to decimal ... add repetion of length $i+1 at this position in the string } $i++; }

Obviously, this is pseudocode. we'd be dealing with 2 bytes representing a number between 256 and 65535 etc, but I hope you see the idea.


Damn, and I was so sure I was on to something...

Ah well, I really must get back to work, but your thoughts/code welcomed on this...

cLive ;-)

Comment on compression challenge
Re (tilly) 1: compression challenge
by tilly (Archbishop) on Apr 25, 2001 at 07:29 UTC
    Sorry.

    Unless you have a way to hide bits up your sleeve (say as eof markers) this problem is impossible for the vast majority of files of any given size.

Re: compression challenge
by clemburg (Curate) on Apr 25, 2001 at 12:43 UTC

    Yeah, yeah, and your generator is a (pseudo- or even worse: real) random generator. Nope. Read some information theory on why that is not possible without cheating (e.g., tricking the operating system into believing the files have a different size than they have).

    It is really a pity that you said "less than". If you had said "less than or equal" I would have requested a file of size zero and sent two zero size files back. Damn ... ;-) ...

    Christian Lemburg
    Brainbench MVP for Perl
    http://www.brainbench.com

Re: compression challenge
by clemburg (Curate) on Apr 25, 2001 at 13:10 UTC

    Hah, a funny way to cheat dawned upon me. Basically, use a shared storage space to put the info from the compressed file in.

    Both environments (where compression happens and where decompression happens) have to have access to the shared storage space. Then the compressor just puts info into the shared storage, and the decompressor fetches it from there. (Could work for lab assistants just copying challenges from a book?)

    E.g., lets assume we have internet access.

    1. Request a file of length 100 MB (just to make sure we stay in the limit while using a perl with libraries).
    2. Compressor: Copy file to publicly accessible website and put URL in a text file (wow - easy compression that is).
    3. Decompressor: script along the lines of lwp-request from standard distribution that takes its argument from text file from step before.
    4. Ship text file and script together with standard perl distribution to challenger. Unpack and install distribution, fetch file from website. YES!!!

    ;-)

    BTW, CPAN is in a way something like that what you request, with the CPAN shell acting as decompressor, and the "install module" input as the compressed content.

    Christian Lemburg
    Brainbench MVP for Perl
    http://www.brainbench.com

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlmeditation [id://75290]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others contemplating the Monastery: (7)
As of 2014-09-17 10:52 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (72 votes), past polls