Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister

Re^3: Dynamically Updating Frequency Analysis

by BrowserUk (Pope)
on May 08, 2013 at 17:27 UTC ( #1032656=note: print w/replies, xml ) Need Help??

in reply to Re^2: Dynamically Updating Frequency Analysis
in thread Dynamically Updating Frequency Analysis

If you haven't seen my response here , it may be worth a read.

Interesting. What your guy seems to trying to do reminds of 3 months wasted work a few years ago.

The specification for a proprietary transmission protocol (over RS232) called for each block to carry a checksum. But the bright spark that specified it decided tha the checksum should be calculated such that it included its own bytes. Thus packet:

# payload checksum xx xx xx xx xx xx xx xx xx xx ... xx xx xx cc cc cc cc

The idea was that when verifying the checksum upon receipt, the entire packet (including checksum) could be passed to the checksum algorithm and the result compared against the unsigned integer in the last four bytes.

Problem: The very thing that makes cryptographic digests of plain text messages work -- the insane combinatorial explosion of work required to find the appropriate padding to make the fake message match the digest -- makes this almost impossible to calculate.

I suspect your guys problem suffers similarly.

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

Replies are listed 'Best First'.
Re^4: Dynamically Updating Frequency Analysis
by Limbic~Region (Chancellor) on May 08, 2013 at 19:04 UTC
    The closest I have had to deal to that problem is where the first few bytes of a message told how long the message was but had to include itself in the length calculation. For illustration purposes, lets say it would just be the number of bytes followed by a capital X followed by the payload.

    If the payload was 98 bytes and we add the letter X we get 99 so we start out by saying '99X..' but by adding 99 to the message we added two more bytes which made it '101X..' but because 101 is longer than 99 by 1 byte, the final value was 102X. This was a simple problem to solve though I never did understand why the length had to include itself.

    Getting back to the OP, here is the approach I have been playing with in my head. There will be 96 characters (Unix newline plus ASCII 32 - ASCII 126). This means a 1/8 reduction simply by going to 7 bit characters and leaves room for 32 multi-character tuples.

    Rather than keeping track of each tuple separately, keep track of data structures of size 3 (originally I thought 4 could fit into memory but I am not so sure). Single characters are ignored because they will be encoded using the 7 bit method.

    HER (count 137) HE ER Points To ERE (count 100) ERM (count 27) RED (count 18) Pointed From IHE (count 11) HEH (count 2)

    Once the data structure is created, get the frequency of all the 2 character tuples (requires traversing the data structure). With this data in hand, use a heuristic algorithm to pick an approach. You could limit how long you look based on time or number of iterations or what not. Pick the tuple (either 2 or 3) that saves the most space. There should be enough data in the tree to measure the impact of every choice and re-calculate frequencies. Once 32 have been chosen - remember the choices and the total size reduction and restart by picking the 2nd highest choice first.

    Does this make sense? It does in my head but I have been struggling with articulating ideas lately.

    Cheers - L~R

      Does this make sense?

      Yes, (sort of :), but I still think that you are on a hiding to nothing with it. The problem is that each time you change things, everything changes and must be recalculated.

      But worse, you may well find yourself in an infinite loop. That is, you make a change, that pushes one thing out in favour of another; but when you recalculate the numbers flip the other way and favour bringing back the one you just kicked out in place of the one you just added.

      That would be relatively easy to detect if the periodicity is 2 or 3; but with 32 slots to fill, the periodicity could theoretically 31!

      You mention a time limit; but what would you set it to? Too short and you'll leave most of your 32 slots unfilled a lot of the time; too long and your burning cpu for no benefit.

      The beauty of LZW is that it processes the input as a stream; never going back to try and apply the latest changes to the earlier part of the data. Whilst that has the disadvantage that the early part of the stream may not be as compressed as it might be; it has the major advantage of being a time-efficient, deterministic process.

      Going back to a recent discussion regarding P v NP; it is 'good enough' when seeking perfection might never complete.

      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        I agree that the changes are cascading and that remaining frequencies will need to be recalculated but I disagree that I will find myself in an infinite loop. Perhaps we are not thinking of it the same way or perhaps I am not seeing what you are seeing.

        Once the data structure has been built, I am going to use it a number of times to determine the best compression I can using some boundaries. The first time will set the high water mark. Pick the item that gives the maximum compression and remove it from the list - which in return removes some other items completely while reducing others. Pick the next highest on the list and repeat. Stop when 32 items have been chosen. Now, start over (tree is completely re-ininitialized to its original form). This time, use an alternate picking strategy - perhaps you alternate from the top of the list on odd number turns and the second highest on even numbered turns. If after selecting 32 items you have reached a higher compression than the previous highest, it becomes your new water mark.

        As far as how long (assuming a time based approach rather than a fixed number of iterations) - I see that as being a command line option. Just as gzip has a -1 through -9 to trade time for compression, this may be a way to let the user choose.

        I really like the adaptive version of LZW. If you fill up your dictionary and the portion of the file you are compressing currently isn't benefiting much from it - dump the dictionary and start over.

        Cheers - L~R

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1032656]
[moritz]: seems there's a size limit for hostenum, and the error message just sucks
[vrk]: None, other than the location of the error: auto/NetAddr/IP/
[vrk]: Aha, there's a reference to "bug report 82719" in NetAddr/'s sub _splitref.
[moritz]: wow, that's their idea of an "improved error message"
[moritz]: how about "This network is too large to enumerate all host addresses." or so?
[vrk]: Well, there's a call to notcontiguous() before the error, whose description is "counts the bit positions remaining in the mask when the rightmost '0's are removed"
[vrk]: Clear as mud!

How do I use this? | Other CB clients
Other Users?
Others avoiding work at the Monastery: (9)
As of 2017-04-25 07:44 GMT
Find Nodes?
    Voting Booth?
    I'm a fool:

    Results (449 votes). Check out past polls.