Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

comment on

( #3333=superdoc: print w/replies, xml ) Need Help??
demerphq,
Unfortunately, life has side tracked me and I don't have time to respond fully but I did want to leave enough details here so that when/if I get a chance, I will have enough context to respond appropriately.

But that still doesn't make sense to me. Why would you encode all the "HE"'s when "THEN" is also in your dictionary?

In my example, it turns out to be a tie. Compressing the 3 instances of 'HE' as step 1 reduces the string from 12 characters to 9. If instead, as step 1 'THEN' was chosen it would still go from 12 characters down to 9. I should have chosen a non-tie example but the idea was that at that point in time, compressing 3 instances of 'HE' was supposed to be better than 1 instance of 'THEN'

I think to understand how the person arrived at this decision (and why it makes sense to him), you need to follow his logic.

  1. Determine the number of distinct 1 byte characters that appear in the file - 96
  2. Use only as many bits as necessary to encode the full range of characters - 2^6 = 64 < 96 < 128 = 2^7
  3. Use the unused 7 bit sequences to encode common multi-byte sequences - he intended to sample the whole file
  4. Compress the file by in a recursive fashion
    1. Choose the sequence that results in maximal compression = $bytes_prior_to_compression - ($num_of_chars_to_encode * $occurrences) + $occurrences
    2. Encode those bytes
    3. Update the frequency data (without re-sampling) and go back to step 1

Also consider that encoding a single THEN wins you more than encoding 3 HE's with the type of scheme you have (4:1 versus 2:1)

Actually, no. Because there is only 1 instance of THEN, encoding it has reduced the file by 3 characters. While it is true that encoding 3 instances of HE also only reduces the file by 3 characters, the example probably should have been 'THE HEN THEN HE HE HE'

Luckily greedy solutions tend to be "close enough" most of the time.

He gave up and went with a heuristic approach that didn't require recalculating the frequency distributions. I didn't see his original approach as viable but I wanted to help him explore it as fully as possible.

If I may be so bold, encouraging people to try to implement algorithms without any domain research is not a bad idea as the starting point of a learning exercise. But I think you should follow up with some proper study of some important algorithms.

That is exactly what this exercise was about. Not everyone agrees with my philosophy of doing things the wrong way to appreciate the right way later.

Cheers - L~R


In reply to Re^4: Dynamically Updating Frequency Analysis by Limbic~Region
in thread Dynamically Updating Frequency Analysis by Limbic~Region

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others perusing the Monastery: (5)
As of 2022-05-24 13:09 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Do you prefer to work remotely?



    Results (82 votes). Check out past polls.

    Notices?