Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

Re: Compressing a set of integers

by PhiRatE (Monk)
on Jan 26, 2003 at 02:04 UTC ( [id://229921]=note: print w/replies, xml ) Need Help??


in reply to Compressing a set of integers

since someone else has already posted a reasonable solution, I'll add a couple of notes:

When discussing numbers, it is important to note the boundary. You said integer, but are negatives possible? is the range outside the 16bit signed limit? 32bit signed limit? is there any limit at all?

When looking for performance solutions, you should also provide information on the platform. Do you need portability? are you on a win32 or unix platform? are you capable of understanding C code? (an Inline::C function would make short work of your problem)

And if possible, providing context is valuable. There have been numerous times when people have posted a very limited description, asking a very targetted question, and it was found that the entire operation could be avoided by use of a different overall strategy. Context helps experienced monks identify those situations.

Replies are listed 'Best First'.
More details on compressing a set of integers
by toma (Vicar) on Jan 26, 2003 at 05:10 UTC
    The numbers represent an index into another array. The numbers are positive. I could offset the numbers by a fixed amount if it would help somehow.

    I have about 50,000 numbers today, but I don't want the code to have a 16 bit limit since I expect the dataset to grow.

    I want to be able to store the data with Storable, and I think that it will not support a custom Inline::C representation. If there is another high-performance way to store the data that can take advantage of Inline::C, then that would be a candidate.

    My platform is somewhat crufty. It is HP-UX 10.20 for the production web server and linux as the data generator. For generality and future-proofing, I prefer portable code, but this is not a hard requirement.

    So far, my main ideas are:

    • Translate the numbers to base64 to make them more compact in string form.
    • Reorder the data so that more common numbers are smaller and are therefore shorter in string form.
    • Create and optimize a state machine to make a real compressor tuned for this special set of characters.
    I have been reluctant to use our database for this task because the lists of numbers can get quite long. I have a VARCHAR limit of 255 characters, and I don't want to get into BLOBs. I could use another database such as SQLite which uses strings as the datatype, but I have almost no experience with this type of solution. I do plan on implementing a solution (for the larger problem) in SQLite for comparison purposes.

    The larger problem is that I am investigating speed versus memory tradeoffs in on-line queries involving a many-to-many relationship. I have an *extremely* fast solution using Table::ParentChild, but it cannot be stored using the Storable module. This is because Table::ParentChild uses an XS data structure, which represents each parent-child relationship in four 32 bit integers. I can sacrifice a little speed and use less memory. To this end I have coded another very fast solution using hashes and arrays, and most of the memory it uses are in these strings of numbers. I think that the strings are still wasteful, since they represent only about 3 bits of information per digit and one bit or so of information per delimiter character.

    I hope to write more about the larger problem in the future, but for now I find that the core of my solution relies on these strings of numbers. The compression problem sounded familiar, so I hoped that someone would have some ideas.

    I have run across this problem before, where I wanted compression, but I have a line-oriented file and I want to be able to decompress just a single line at a time, when this line is located anywhere in the file. It seems like an ordinary problem in compression, but after searching for a few hours I didn't find anything, so I asked first in chat and then here.

    Update:
    I have found the classification of the larger problem that I am working on. It is a graph problem. My graphs are sparse and large, so I am using an adjacency list. The sets that I am compressing are rows in the adjacency list. This is described in chapter 8 of Mastering Algorithms with Perl, where the Graph modules are described.

    It should work perfectly the first time! - toma

      Its worth pointing out that attempting to compress a string containing a list of comma delimited ascii numbers using Base64, will make the string grow in size, not shrink.

      Eg. A string containing 1000 5 digit numbers separated by commas, is 6000 bytes. Encode that base64 and you get a string nearly 8K in length.

      Pack the same string using 16-bits and you get 2000 bytes.

      If you really need to make it smaller still, then you might consider using Compress::Zlib::memGzip which compresses the same ascii list into 1800 bytes.

      There are caveats associated with this 10% reduction in size. A) your compression will vary with the data set. B) Compress::Zlib::memGunzip is 10 to 15 times slower than unpack.

      If your main concern is that using pack & unpack with format 'S*', will limit you to 16-bit numbers, then move to 32-bit using 'L*'. This will pack the same 1000 numbers into 4000 bytes which makes it look not such a good option until you think about what happens when you start using 6-digit numbers.

      The string containing 100,000 .. 101,000, comma delimited ascii is 7000 bytes. Base64:9467, zipped:1817, packed 32-bit 4000. Again, Zlib is looking good, but now the figures for a random mixed of 1 to 6-digit numbers

      ascii:6458, base64:8726, zipped:3169, pack 32-bit:4000.

      Suddenly the trade for decompression speed looks less valuable, and if you move to a random mix of 8-digit numbers, Zlib starts averaging 4500 bytes for 1000 numbers, pack remains at 4000 bytes.

      Finally, with Zlib, there is a heavy penalty for small lists. Lists of 1 x 5-digits average around 25 bytes; 4 x 5-digits numbers around 45 bytes and 4 x 8-digit numbers around 75 bytes. The corresponding pack'd lists come out at 2/8 for 'S*' and 4/16/16 for 'L*'.


      Examine what is said, not who speaks.

      The 7th Rule of perl club is -- pearl clubs are easily damaged. Use a diamond club instead.

        Sorry, I should have mentioned that I didn't mean to use the Base64 module, I was considering writing the numbers in base 64, so that when I run out of 0-9 I go through a-z then A-Z, etc, so I use more of the bits of the characters, and use fewer digits to represent the same number.

        The penalty for compression on small strings is the dictionary that gets set up for the string. If I use a fixed dictionary, this shouldn't be a problem. So I need a customized version of the compression code.

        With compression, I shouldn't need to use base 64, and I'm guessing that I will get about a 60% reduction in size.

        Another optimization I thought of was to change the numbers to represent the differences between the numbers instead of the numbers themselves. So instead of

        1,10,45,50,120
        I would have
        1,9,35,5,70

        It should work perfectly the first time! - toma

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://229921]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others surveying the Monastery: (9)
As of 2024-03-19 08:43 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found