Perl Monk, Perl Meditation PerlMonks

### Comment on

 Need Help??
All,
I have presented a contest to my team to create the best compression algorithm they can. I do these contests occassionally to introduce them to new concepts, get their creative juices flowing and come up with innovative ways of dealing with "big data". While reading about existing compression algorithms is not prohibited - it is discouraged in the spirit of the competition. I was surprised then when one of them was on their way to discovering Huffman Coding on their own (he was trying a top down approach rather than a bottom up).

For each solution they come up with, I am attempting to improve on it. With the Huffman Coding problem I wanted to take any "holes" in the tree and plug them with some high frequency character tuples. I ran into problems which you can read about here. In a nutshell, adding to the finished tree changes the frequencies which requires the tree to be rebuilt which in turn requires adding to the new tree (infinite recursion). This reminded me of a problem someone else on the team is having.

The person having the problem is scanning the file to identify character tuples that appear frequently to encode them. Due to the math involved, he is limited to looking at sequences no longer than 4 characters. His problem is how to deal with a constantly changing frequency list:

```THEN
HE
ENDR
Let's say that when comparing the frequencies 'HE' is the biggest bang for the buck. After substituting it, THEN needs to be knocked out of the list entirely. Once THEN is knocked out, ENDR may have to be reduced by whatever THEN was.

I showed him a way to deal with the problem on the small scale and also suggested he look into Suffix Trees. My question to you is - assuming that there are 96 different possible characters, is there a way to do this efficiently?

Update: I have also just realized that this problem is even more difficult. Because every choice affects the frequency of other entries - it is possible the most effective first choice isn't to choose number 1 from the list but to choose a combination of the lower choices.

Update 2: While I alluded to the fact that he will not be coding for all tuples - just the most popular, I should state it explicitly. There are roughly 100 single byte symbols in the input. He intends to use 7 bits to represent them and use the remaining 30 or so possible 7 bits to encode longer byte sequences. His problem is identifying which ones are most advantageous

Cheers - L~R

In reply to Dynamically Updating Frequency Analysis by Limbic~Region

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":

• Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
• Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
• Read Where should I post X? if you're not absolutely sure you're posting in the right place.
• Posts may use any of the Perl Monks Approved HTML tags:
a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
• You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
 For: Use: & & < < > > [ [ ] ]
• Link using PerlMonks shortcuts! What shortcuts can I use for linking?

Create A New User
Chatterbox?
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others exploiting the Monastery: (10)
As of 2018-03-20 21:51 GMT
Sections?
Information?
Find Nodes?
Leftovers?
Voting Booth?
When I think of a mole I think of:

Results (259 votes). Check out past polls.

Notices?