### Re^2: Dynamically Updating Frequency Analysis

by Limbic~Region (Chancellor)
 on May 09, 2013 at 18:01 UTC ( #1032809=note: print w/replies, xml ) Need Help??

demerphq,
Nice speaking with you again as well. I completley understand how confusing this must be to everyone else since I am having a hard time keeping it straight myself. I have multiple people on my team and they are each coming up with different ideas and I have my own. They are overlapping.

The reference to the Huffman Coding and "plugging holes" has nothing to do with this post - it deals with Challenge: Optimal Animals/Pangolins Strategy. You are absolutely correct about there not being any holes. The holes come in when you don't "properly construct" the tree. Somewhere in one of these threads, I mentioned that he is populating the tree top down instead of bottom up. I read somewhere that Huffman's professors had the same problem - it isn't perfect. I am sure this has confused some people because I have continued to refer to holes and Huffman Coding together rather than saying Huffman Coding-ish. I originally didn't mention Huffman Coding at all because that wasn't the road I was trying to go down - it just seemed like they were equivalent solutions - they are not.

The reason this thread doesn't look like it has anything to do with Huffman coding is because it isn't. This guy's idea was to map all 96 symbols to 7 bits and use the 32 remaining 7 bit sequences to encode some popular multi-character tuples.

You asked in reference to THEN and HE:

It is not clear why this is so. Is this because you aren't allowed overlapping sequences? Or some other reason. Can you explain more?

It is because HE is a complete substring of THEN. Once HE no longer exists anywhere in the file, there will no longer be any THEN to replace. Imagine the following dictionary:

```0000000 = T
0000001 = H
0000010 = E
0000011 = N
0000100 = HE
0000101 = THEN
0000110 = <space>
Now imagine the file we are compressing contains
```THE HEN THEN
We note that HE appears 3 times and will give us the largest compression (I have used # to indicate the data has already been compressed).
```T# #N T#N

There is no longer a THEN to consider.

Anyway, it sounds like this problem is NP hard

It certainly does which is why I think a heuristic solution would work. I just need to think about BrowserUk's concerns.

Cheers - L~R

Replies are listed 'Best First'.
Re^3: Dynamically Updating Frequency Analysis
by demerphq (Chancellor) on May 10, 2013 at 06:44 UTC

We note that HE appears 3 times and will give us the largest compression (I have used # to indicate the data has already been compressed).

But that still doesn't make sense to me. Why would you encode all the "HE"'s when "THEN" is also in your dictionary? Of the choices you could make here, that seems the strangest option. For instance if you were taking that approach I would expect you to encode from longest to shortest first. In which case I would expect the THEN to match first, and then HE to match twice. Which is actually the optimal match for this string and dictionary. IE: "THE HEN THEN" would turn into "T" "HE" "HE" "N" "THEN", which at 5 symbols is shorter than the 7 that makes up "T" "HE" "HE" "N" "T" "HE" "N".

Also consider that encoding a single THEN wins you more than encoding 3 HE's with the type of scheme you have (4:1 versus 2:1), which is not a problem you would have with huffman encoding where the bit sequence for HE might be shorter than that of THEN. Perhaps you need to account for the length of the string being replaced when you do your frequency analysis.

Anyway, there are other issues here. Once you allow overlapping strings you have a problem that is much harder. A simple greedy algorithm will not necessarily find the optimal encoding of the string. For that you have to do exhaustive search. Luckily greedy solutions tend to be "close enough" most of the time.

If I may be so bold, encouraging people to try to implement algorithms without any domain research is not a bad idea as the starting point of a learning exercise. But I think you should follow up with some proper study of some important algorithms. In this case I would suggest really studying Huffman encoding and LZ. Both are straight forward to implement in perl, and Huffman encoding in particular is nearly trivial to implement in Perl. Here is the tree construction algorithm:

```sub build_huffman_tree {
my (\$dict_hash)= @_;
# @node constains either:
# leafs: [ \$non_ref_symbol,             \$count ]
# nodes: [ [ \$left_node, \$right_node ], \$left_count + \$right_count ]
+;
# note we assume the counts in the hash are > 0.
my @nodes= map { [ \$_, \$dict_hash->{\$_} ] } keys %\$dict_hash;
while (@nodes > 1) {
@nodes= sort { \$b->[1] <=> \$a->[1] } @nodes;
my \$right= pop @nodes;
my \$left= pop @nodes;
push @nodes, [ [ \$left, \$right ], \$left->[1] + \$right->[1] ];
}
return \$nodes[0];
}

Doing the rest is a good exercise for your people and will teach them more than struggling with their own algorithms without understanding these ones first.

---
\$world=~s/war/peace/g

demerphq,
Unfortunately, life has side tracked me and I don't have time to respond fully but I did want to leave enough details here so that when/if I get a chance, I will have enough context to respond appropriately.

But that still doesn't make sense to me. Why would you encode all the "HE"'s when "THEN" is also in your dictionary?

In my example, it turns out to be a tie. Compressing the 3 instances of 'HE' as step 1 reduces the string from 12 characters to 9. If instead, as step 1 'THEN' was chosen it would still go from 12 characters down to 9. I should have chosen a non-tie example but the idea was that at that point in time, compressing 3 instances of 'HE' was supposed to be better than 1 instance of 'THEN'

I think to understand how the person arrived at this decision (and why it makes sense to him), you need to follow his logic.

1. Determine the number of distinct 1 byte characters that appear in the file - 96
2. Use only as many bits as necessary to encode the full range of characters - 2^6 = 64 < 96 < 128 = 2^7
3. Use the unused 7 bit sequences to encode common multi-byte sequences - he intended to sample the whole file
4. Compress the file by in a recursive fashion
1. Choose the sequence that results in maximal compression = \$bytes_prior_to_compression - (\$num_of_chars_to_encode * \$occurrences) + \$occurrences
2. Encode those bytes
3. Update the frequency data (without re-sampling) and go back to step 1

Also consider that encoding a single THEN wins you more than encoding 3 HE's with the type of scheme you have (4:1 versus 2:1)

Actually, no. Because there is only 1 instance of THEN, encoding it has reduced the file by 3 characters. While it is true that encoding 3 instances of HE also only reduces the file by 3 characters, the example probably should have been 'THE HEN THEN HE HE HE'

Luckily greedy solutions tend to be "close enough" most of the time.

He gave up and went with a heuristic approach that didn't require recalculating the frequency distributions. I didn't see his original approach as viable but I wanted to help him explore it as fully as possible.

If I may be so bold, encouraging people to try to implement algorithms without any domain research is not a bad idea as the starting point of a learning exercise. But I think you should follow up with some proper study of some important algorithms.

That is exactly what this exercise was about. Not everyone agrees with my philosophy of doing things the wrong way to appreciate the right way later.

Cheers - L~R

Create A New User
Node Status?
node history
Node Type: note [id://1032809]
help
Chatterbox?
and a moth chases the moon...

How do I use this? | Other CB clients
Other Users?
Others browsing the Monastery: (5)
As of 2018-04-19 21:45 GMT
Sections?
Information?
Find Nodes?
Leftovers?
Voting Booth?
My travels bear the most uncanny semblance to ...

Results (75 votes). Check out past polls.

Notices?