http://www.perlmonks.org?node_id=1031909

Limbic~Region has asked for the wisdom of the Perl Monks concerning the following question:

All,
I have presented a contest to my team to create the best compression algorithm they can. I do these contests occassionally to introduce them to new concepts, get their creative juices flowing and come up with innovative ways of dealing with "big data". While reading about existing compression algorithms is not prohibited - it is discouraged in the spirit of the competition. I was surprised then when one of them was on their way to discovering Huffman Coding on their own (he was trying a top down approach rather than a bottom up).

For each solution they come up with, I am attempting to improve on it. With the Huffman Coding problem I wanted to take any "holes" in the tree and plug them with some high frequency character tuples. I ran into problems which you can read about here. In a nutshell, adding to the finished tree changes the frequencies which requires the tree to be rebuilt which in turn requires adding to the new tree (infinite recursion). This reminded me of a problem someone else on the team is having.

The person having the problem is scanning the file to identify character tuples that appear frequently to encode them. Due to the math involved, he is limited to looking at sequences no longer than 4 characters. His problem is how to deal with a constantly changing frequency list:

THEN HE ENDR
Let's say that when comparing the frequencies 'HE' is the biggest bang for the buck. After substituting it, THEN needs to be knocked out of the list entirely. Once THEN is knocked out, ENDR may have to be reduced by whatever THEN was.

I showed him a way to deal with the problem on the small scale and also suggested he look into Suffix Trees. My question to you is - assuming that there are 96 different possible characters, is there a way to do this efficiently?

Update: I have also just realized that this problem is even more difficult. Because every choice affects the frequency of other entries - it is possible the most effective first choice isn't to choose number 1 from the list but to choose a combination of the lower choices.

Update 2: While I alluded to the fact that he will not be coding for all tuples - just the most popular, I should state it explicitly. There are roughly 100 single byte symbols in the input. He intends to use 7 bits to represent them and use the remaining 30 or so possible 7 bits to encode longer byte sequences. His problem is identifying which ones are most advantageous

Cheers - L~R

Replies are listed 'Best First'.
Re: Dynamically Updating Frequency Analysis
by BrowserUk (Patriarch) on May 08, 2013 at 11:11 UTC

    This morning I was discussing memories of Byte, and how it was far superior to anything else then or since. And how the 2011 "re-launch" completely missed the essence of what made Byte stand out from the crowd.

    One of the articles that came back to mind was a very detailed discussion of the LZW algorithm. And that struck a cord with the stuff you've been talking about. Just a thought.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      BrowserUk,
      I don't specifically remember Byte but I do remember the like. I have fond and frustrating memories of painfully entering programs into the computer, saving them and attempting to run them only to get a cryptic errors only to start over. Sometimes a subsequent issue would have a correction.

      I like LZW and the adaptive variations that have evolved from it. At some point when this is all over, I will walk the team through Huffman Coding and LZW as well as some others. If you haven't seen my response here , it may be worth a read.

      Cheers - L~R

        If you haven't seen my response here , it may be worth a read.

        Interesting. What your guy seems to trying to do reminds of 3 months wasted work a few years ago.

        The specification for a proprietary transmission protocol (over RS232) called for each block to carry a checksum. But the bright spark that specified it decided tha the checksum should be calculated such that it included its own bytes. Thus packet:

        # payload checksum xx xx xx xx xx xx xx xx xx xx ... xx xx xx cc cc cc cc

        The idea was that when verifying the checksum upon receipt, the entire packet (including checksum) could be passed to the checksum algorithm and the result compared against the unsigned integer in the last four bytes.

        Problem: The very thing that makes cryptographic digests of plain text messages work -- the insane combinatorial explosion of work required to find the appropriate padding to make the fake message match the digest -- makes this almost impossible to calculate.

        I suspect your guys problem suffers similarly.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Dynamically Updating Frequency Analysis
by demerphq (Chancellor) on May 09, 2013 at 11:37 UTC

    Hi Limbic. Good to speak to you again. :-) I don't understand some things you have said.

    With the Huffman Coding problem I wanted to take any "holes" in the tree and plug them with some high frequency character tuples.

    The reason I don't get this is because normally there are no holes in a properly constructed huffman tree. The tree is "full", that is every node either has two child nodes or is a leaf node with no children, or in other words there are no nodes with only 1 child. This is a necessary aspect of huffman encoding.

    Your updates seem to suggest that you aren't going to be doing huffman encoding at all, but some kind of dictionary mapping to fixed size output tokens. Also you said:

    when comparing the frequencies 'HE' is the biggest bang for the buck. After substituting it, THEN needs to be knocked out of the list entirely.

    It is not clear why this is so. Is this because you aren't allowed overlapping sequences? Or some other reason. Can you explain more?

    Anyway, it sounds like this problem is NP hard, so if you really want the perfect solution you will have to brute force it. IMO you dont need to, you can find a "good enough" solution with a couple of iterations I should think. Especially as you say a certain set must appear in the table.

    ---
    $world=~s/war/peace/g

      demerphq,
      Nice speaking with you again as well. I completley understand how confusing this must be to everyone else since I am having a hard time keeping it straight myself. I have multiple people on my team and they are each coming up with different ideas and I have my own. They are overlapping.

      The reference to the Huffman Coding and "plugging holes" has nothing to do with this post - it deals with Challenge: Optimal Animals/Pangolins Strategy. You are absolutely correct about there not being any holes. The holes come in when you don't "properly construct" the tree. Somewhere in one of these threads, I mentioned that he is populating the tree top down instead of bottom up. I read somewhere that Huffman's professors had the same problem - it isn't perfect. I am sure this has confused some people because I have continued to refer to holes and Huffman Coding together rather than saying Huffman Coding-ish. I originally didn't mention Huffman Coding at all because that wasn't the road I was trying to go down - it just seemed like they were equivalent solutions - they are not.

      The reason this thread doesn't look like it has anything to do with Huffman coding is because it isn't. This guy's idea was to map all 96 symbols to 7 bits and use the 32 remaining 7 bit sequences to encode some popular multi-character tuples.

      You asked in reference to THEN and HE:

      It is not clear why this is so. Is this because you aren't allowed overlapping sequences? Or some other reason. Can you explain more?

      It is because HE is a complete substring of THEN. Once HE no longer exists anywhere in the file, there will no longer be any THEN to replace. Imagine the following dictionary:

      0000000 = T 0000001 = H 0000010 = E 0000011 = N 0000100 = HE 0000101 = THEN 0000110 = <space>
      Now imagine the file we are compressing contains
      THE HEN THEN
      We note that HE appears 3 times and will give us the largest compression (I have used # to indicate the data has already been compressed).
      T# #N T#N

      There is no longer a THEN to consider.

      Anyway, it sounds like this problem is NP hard

      It certainly does which is why I think a heuristic solution would work. I just need to think about BrowserUk's concerns.

      Cheers - L~R

        We note that HE appears 3 times and will give us the largest compression (I have used # to indicate the data has already been compressed).

        But that still doesn't make sense to me. Why would you encode all the "HE"'s when "THEN" is also in your dictionary? Of the choices you could make here, that seems the strangest option. For instance if you were taking that approach I would expect you to encode from longest to shortest first. In which case I would expect the THEN to match first, and then HE to match twice. Which is actually the optimal match for this string and dictionary. IE: "THE HEN THEN" would turn into "T" "HE" "HE" "N" "THEN", which at 5 symbols is shorter than the 7 that makes up "T" "HE" "HE" "N" "T" "HE" "N".

        Also consider that encoding a single THEN wins you more than encoding 3 HE's with the type of scheme you have (4:1 versus 2:1), which is not a problem you would have with huffman encoding where the bit sequence for HE might be shorter than that of THEN. Perhaps you need to account for the length of the string being replaced when you do your frequency analysis.

        Anyway, there are other issues here. Once you allow overlapping strings you have a problem that is much harder. A simple greedy algorithm will not necessarily find the optimal encoding of the string. For that you have to do exhaustive search. Luckily greedy solutions tend to be "close enough" most of the time.

        If I may be so bold, encouraging people to try to implement algorithms without any domain research is not a bad idea as the starting point of a learning exercise. But I think you should follow up with some proper study of some important algorithms. In this case I would suggest really studying Huffman encoding and LZ. Both are straight forward to implement in perl, and Huffman encoding in particular is nearly trivial to implement in Perl. Here is the tree construction algorithm:

        sub build_huffman_tree { my ($dict_hash)= @_; # @node constains either: # leafs: [ $non_ref_symbol, $count ] # nodes: [ [ $left_node, $right_node ], $left_count + $right_count ] +; # note we assume the counts in the hash are > 0. my @nodes= map { [ $_, $dict_hash->{$_} ] } keys %$dict_hash; while (@nodes > 1) { @nodes= sort { $b->[1] <=> $a->[1] } @nodes; my $right= pop @nodes; my $left= pop @nodes; push @nodes, [ [ $left, $right ], $left->[1] + $right->[1] ]; } return $nodes[0]; }

        Doing the rest is a good exercise for your people and will teach them more than struggling with their own algorithms without understanding these ones first.

        ---
        $world=~s/war/peace/g

Re: Dynamically Updating Frequency Analysis
by Voronich (Hermit) on May 07, 2013 at 14:02 UTC
    Oh, that's the XYZ problem. Tt has been proven to be NP complete but the ABC heuristic approach is likely of value.

    here to help o/

    Alright alright. CB smartassery aside, this reminds me of the natural language text indexing projects I embarked upon when I was a puppy. It sounds like he's going too far at once. Why not emit the n-grams (for n 1-4) into a frequency distribution hash, then worry about building the tree out of it.

      Voronich,
      I am afraid I don't have the first clue how to apply what you said. To make sure we are speaking the same language, I believe your use of 'n-gram' is the same as my use of tuple. If you mean something else, please let me know. Assuming that is correct, then he is already that far along and a hand wavvy then worry about building the tree doesn't help.

      Perhaps you could explain what you mean using a very short example:

      Where do you want to go? Over there. Then go already.

      Cheers - L~R

        Note: This almost certainly means that I didn't understand the problem. I wrote a reply yesterday and seem to have lost it in the aether.

        Looking at it again with sleep and caffeine I realize I'm missing something. I was talking in terms of building the initial tree and read right past your point mention of filling in gaps, which I didn't really understand.

        I don't understand why there's a 'changing frequency list'. For a single body of text there's only one frequency list, no?