Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris

comment on

( #3333=superdoc: print w/replies, xml ) Need Help??

We note that HE appears 3 times and will give us the largest compression (I have used # to indicate the data has already been compressed).

But that still doesn't make sense to me. Why would you encode all the "HE"'s when "THEN" is also in your dictionary? Of the choices you could make here, that seems the strangest option. For instance if you were taking that approach I would expect you to encode from longest to shortest first. In which case I would expect the THEN to match first, and then HE to match twice. Which is actually the optimal match for this string and dictionary. IE: "THE HEN THEN" would turn into "T" "HE" "HE" "N" "THEN", which at 5 symbols is shorter than the 7 that makes up "T" "HE" "HE" "N" "T" "HE" "N".

Also consider that encoding a single THEN wins you more than encoding 3 HE's with the type of scheme you have (4:1 versus 2:1), which is not a problem you would have with huffman encoding where the bit sequence for HE might be shorter than that of THEN. Perhaps you need to account for the length of the string being replaced when you do your frequency analysis.

Anyway, there are other issues here. Once you allow overlapping strings you have a problem that is much harder. A simple greedy algorithm will not necessarily find the optimal encoding of the string. For that you have to do exhaustive search. Luckily greedy solutions tend to be "close enough" most of the time.

If I may be so bold, encouraging people to try to implement algorithms without any domain research is not a bad idea as the starting point of a learning exercise. But I think you should follow up with some proper study of some important algorithms. In this case I would suggest really studying Huffman encoding and LZ. Both are straight forward to implement in perl, and Huffman encoding in particular is nearly trivial to implement in Perl. Here is the tree construction algorithm:

sub build_huffman_tree { my ($dict_hash)= @_; # @node constains either: # leafs: [ $non_ref_symbol, $count ] # nodes: [ [ $left_node, $right_node ], $left_count + $right_count ] +; # note we assume the counts in the hash are > 0. my @nodes= map { [ $_, $dict_hash->{$_} ] } keys %$dict_hash; while (@nodes > 1) { @nodes= sort { $b->[1] <=> $a->[1] } @nodes; my $right= pop @nodes; my $left= pop @nodes; push @nodes, [ [ $left, $right ], $left->[1] + $right->[1] ]; } return $nodes[0]; }

Doing the rest is a good exercise for your people and will teach them more than struggling with their own algorithms without understanding these ones first.


In reply to Re^3: Dynamically Updating Frequency Analysis by demerphq
in thread Dynamically Updating Frequency Analysis by Limbic~Region

Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":

  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or or How to display code and escape characters are good places to start.
Log In?

What's my password?
Create A New User
Domain Nodelet?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others romping around the Monastery: (2)
As of 2022-05-26 04:15 GMT
Find Nodes?
    Voting Booth?
    Do you prefer to work remotely?

    Results (93 votes). Check out past polls.