|We don't bite newbies here... much|
We note that HE appears 3 times and will give us the largest compression (I have used # to indicate the data has already been compressed).
But that still doesn't make sense to me. Why would you encode all the "HE"'s when "THEN" is also in your dictionary? Of the choices you could make here, that seems the strangest option. For instance if you were taking that approach I would expect you to encode from longest to shortest first. In which case I would expect the THEN to match first, and then HE to match twice. Which is actually the optimal match for this string and dictionary. IE: "THE HEN THEN" would turn into "T" "HE" "HE" "N" "THEN", which at 5 symbols is shorter than the 7 that makes up "T" "HE" "HE" "N" "T" "HE" "N".
Also consider that encoding a single THEN wins you more than encoding 3 HE's with the type of scheme you have (4:1 versus 2:1), which is not a problem you would have with huffman encoding where the bit sequence for HE might be shorter than that of THEN. Perhaps you need to account for the length of the string being replaced when you do your frequency analysis.
Anyway, there are other issues here. Once you allow overlapping strings you have a problem that is much harder. A simple greedy algorithm will not necessarily find the optimal encoding of the string. For that you have to do exhaustive search. Luckily greedy solutions tend to be "close enough" most of the time.
If I may be so bold, encouraging people to try to implement algorithms without any domain research is not a bad idea as the starting point of a learning exercise. But I think you should follow up with some proper study of some important algorithms. In this case I would suggest really studying Huffman encoding and LZ. Both are straight forward to implement in perl, and Huffman encoding in particular is nearly trivial to implement in Perl. Here is the tree construction algorithm:
Doing the rest is a good exercise for your people and will teach them more than struggling with their own algorithms without understanding these ones first.