I have presented a contest to my team to create the best compression algorithm they can. I do these contests occassionally to introduce them to new concepts, get their creative juices flowing and come up with innovative ways of dealing with "big data". While reading about existing compression algorithms is not prohibited - it is discouraged in the spirit of the competition. I was surprised then when one of them was on their way to discovering Huffman Coding on their own (he was trying a top down approach rather than a bottom up).
For each solution they come up with, I am attempting to improve on it. With the Huffman Coding problem I wanted to take any "holes" in the tree and plug them with some high frequency character tuples. I ran into problems which you can read about here. In a nutshell, adding to the finished tree changes the frequencies which requires the tree to be rebuilt which in turn requires adding to the new tree (infinite recursion). This reminded me of a problem someone else on the team is having.
The person having the problem is scanning the file to identify character tuples that appear frequently to encode them. Due to the math involved, he is limited to looking at sequences no longer than 4 characters. His problem is how to deal with a constantly changing frequency list:
Let's say that when comparing the frequencies 'HE' is the biggest bang for the buck. After substituting it, THEN needs to be knocked out of the list entirely. Once THEN is knocked out, ENDR may have to be reduced by whatever THEN was.
I showed him a way to deal with the problem on the small scale and also suggested he look into Suffix Trees. My question to you is - assuming that there are 96 different possible characters, is there a way to do this efficiently?
Update: I have also just realized that this problem is even more difficult. Because every choice affects the frequency of other entries - it is possible the most effective first choice isn't to choose number 1 from the list but to choose a combination of the lower choices.
Update 2: While I alluded to the fact that he will not be coding for all tuples - just the most popular, I should state it explicitly. There are roughly 100 single byte symbols in the input. He intends to use 7 bits to represent them and use the remaining 30 or so possible 7 bits to encode longer byte sequences. His problem is identifying which ones are most advantageous
Cheers - L~R