Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Comment on

( #3333=superdoc: print w/ replies, xml ) Need Help??
All,
I have presented a contest to my team to create the best compression algorithm they can. I do these contests occassionally to introduce them to new concepts, get their creative juices flowing and come up with innovative ways of dealing with "big data". While reading about existing compression algorithms is not prohibited - it is discouraged in the spirit of the competition. I was surprised then when one of them was on their way to discovering Huffman Coding on their own (he was trying a top down approach rather than a bottom up).

For each solution they come up with, I am attempting to improve on it. With the Huffman Coding problem I wanted to take any "holes" in the tree and plug them with some high frequency character tuples. I ran into problems which you can read about here. In a nutshell, adding to the finished tree changes the frequencies which requires the tree to be rebuilt which in turn requires adding to the new tree (infinite recursion). This reminded me of a problem someone else on the team is having.

The person having the problem is scanning the file to identify character tuples that appear frequently to encode them. Due to the math involved, he is limited to looking at sequences no longer than 4 characters. His problem is how to deal with a constantly changing frequency list:

THEN HE ENDR
Let's say that when comparing the frequencies 'HE' is the biggest bang for the buck. After substituting it, THEN needs to be knocked out of the list entirely. Once THEN is knocked out, ENDR may have to be reduced by whatever THEN was.

I showed him a way to deal with the problem on the small scale and also suggested he look into Suffix Trees. My question to you is - assuming that there are 96 different possible characters, is there a way to do this efficiently?

Update: I have also just realized that this problem is even more difficult. Because every choice affects the frequency of other entries - it is possible the most effective first choice isn't to choose number 1 from the list but to choose a combination of the lower choices.

Update 2: While I alluded to the fact that he will not be coding for all tuples - just the most popular, I should state it explicitly. There are roughly 100 single byte symbols in the input. He intends to use 7 bits to represent them and use the remaining 30 or so possible 7 bits to encode longer byte sequences. His problem is identifying which ones are most advantageous

Cheers - L~R


In reply to Dynamically Updating Frequency Analysis by Limbic~Region

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • Outside of code tags, you may need to use entities for some characters:
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?
    Username:
    Password:

    What's my password?
    Create A New User
    Chatterbox?
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others lurking in the Monastery: (8)
    As of 2014-07-12 05:28 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?

      When choosing user names for websites, I prefer to use:








      Results (238 votes), past polls