Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Compressing a set of integers

by toma (Vicar)
on Jan 25, 2003 at 04:30 UTC ( #229792=perlquestion: print w/replies, xml ) Need Help??

toma has asked for the wisdom of the Perl Monks concerning the following question:

I have a large array that contains sets of integers.
array[0]="1,100,105,1003"; array[1]="42,314,2718,30302"; etc...
I would like to save RAM with a more efficient representation of these sets. Here are a few facts about the numbers:
  • The numbers do not repeat
  • The numbers represent an index into another array
  • I can reorder the data if I really need to
  • Some numbers are much more common than others
  • I don't care much about how long it takes to compress the data
  • I would like to decompress the data relatively quickly
  • In some cases there are thousands of entries per row, other cases only one.
The best answer might be to just use a database. I was thinking about a RAM-based solution in an attempt to get blazing speed in a mod_perl application. It would be nice if I could look up and decode about 100 of these rows in 0.05 seconds or so.

I would also like to be able to load these numbers into RAM relatively quickly, so I would like a data structure that can be stored and restored with Storable or some other quick method.

I am looking for ideas, pointers to algorithms, modules, or code. It seems like a fairly fundamental area of computer science, but I don't know what this area is called.

Thanks!

It should work perfectly the first time! - toma

Replies are listed 'Best First'.
Re: Compressing a set of integers
by BrowserUk (Patriarch) on Jan 25, 2003 at 05:33 UTC

    Just storing your number in binary rather than ascii with delimiters would cut your memory requirements to about 1/3 on average. Assuming that your example is representative and your numbers will fit in 16-bits. You mentioned the figures of 50,000 sets and assuming 4 per set as shown.

    Instead of approx 1 MB, your down to 390k, ignoring the cost of the arrays which should be constant That's a start.

    Projecting the timing of the code below gives around 3 seconds to pack the 50k x 4 x 16 bits and 0.002 secs to unpack 100 lines, which comfortably fits your spec, and my machine is hardly the quickest around.

    #! perl -slw use strict; use Benchmark::Timer; my $t= Benchmark::Timer->new(); my (@ary1, @ary2); #! 1000 lines of 4 random numbers (0-65536) in ascii with commas. @ary1 = map{ join ',', map{ int rand(2**16) } 1..4 } 1 .. 1000; print $ary1[0]; $t->start('packing'); #! The same numbers stored as binary strings @ary2 = map{ pack 'S4', split',', $_ } @ary1; $t->stop('packing'); print length $ary2[0]; print '1000 in ascii: ', scalar @ary1, ' : ', length "@ary1"; print '1000 in binary: ', scalar @ary2, ' : ', length "@ary2"; $t->start('unpacking'); for (@ary2) { my @line_of_nums = unpack 'S4', $_; # print "@line_of_nums"; } $t->stop('unpacking'); $t->report(); __END__ C:\test>229792 52318,4206,9238,48758 8 1000 in ascii: 1000 : 22310 1000 in binary: 1000 : 8000 1 trial of packing (50ms total) 1 trial of unpacking (20ms total)

    Examine what is said, not who speaks.

    The 7th Rule of perl club is -- pearl clubs are easily damaged. Use a diamond club instead.

      This is good, but since he said the number of entries per row may vary, he should pack "S*" instead of pack "S4"
Re: Compressing a set of integers
by bart (Canon) on Jan 27, 2003 at 01:34 UTC
    You say you don't know if your numbers will fit into 16 bit, and using 32 bit per number simply doubles storage space.

    May I point you to the fact that there is an option in pack/unpack to store numbers with built in compression? They're called "BER-compressed integers" (see perlfunc:pack — use the "w" template), and they use a few bytes as possible. They're slightly less efficient in storage on the maximum number, as only 7 bits per byte are significant. But on the plus side: they don't waste any bytes they don't need, so for not so extreme values, the waste is no more than with the fixed length representation, and often less.

    Experimentally I've determined the following ranges for the byte count:

    rangebyte count
    0 - 1271
    128 - 163832
    16384 - 20971513
    2097152 - 2684354554
    268435456 - 4294967295 (?)5

    The range for 5 bytes runs at least up to 4294967295, i.e. 2**32-1, but above that, something goes wrong, at least on Indigoperl 5.6.1/Win32. Suddenly, I get a value for 2**32 that doesn't make any sense: it takes 10 bytes, for a start, while 2**32-1 only takes 5. In theory, there's no upper limit for which you could use this kind of representation, so I have no idea on what's going on.

    That problem aside, here's an extra way you can save on space: sort the numbers in ascending order, and store the difference between successing numbers — the first value being the smallest number itself, the difference from zero. The more numbers you have, the closer they'll be together, and that way, you might shave of a few bytes.

Re: Compressing a set of integers
by PhiRatE (Monk) on Jan 26, 2003 at 02:04 UTC
    since someone else has already posted a reasonable solution, I'll add a couple of notes:

    When discussing numbers, it is important to note the boundary. You said integer, but are negatives possible? is the range outside the 16bit signed limit? 32bit signed limit? is there any limit at all?

    When looking for performance solutions, you should also provide information on the platform. Do you need portability? are you on a win32 or unix platform? are you capable of understanding C code? (an Inline::C function would make short work of your problem)

    And if possible, providing context is valuable. There have been numerous times when people have posted a very limited description, asking a very targetted question, and it was found that the entire operation could be avoided by use of a different overall strategy. Context helps experienced monks identify those situations.

      The numbers represent an index into another array. The numbers are positive. I could offset the numbers by a fixed amount if it would help somehow.

      I have about 50,000 numbers today, but I don't want the code to have a 16 bit limit since I expect the dataset to grow.

      I want to be able to store the data with Storable, and I think that it will not support a custom Inline::C representation. If there is another high-performance way to store the data that can take advantage of Inline::C, then that would be a candidate.

      My platform is somewhat crufty. It is HP-UX 10.20 for the production web server and linux as the data generator. For generality and future-proofing, I prefer portable code, but this is not a hard requirement.

      So far, my main ideas are:

      • Translate the numbers to base64 to make them more compact in string form.
      • Reorder the data so that more common numbers are smaller and are therefore shorter in string form.
      • Create and optimize a state machine to make a real compressor tuned for this special set of characters.
      I have been reluctant to use our database for this task because the lists of numbers can get quite long. I have a VARCHAR limit of 255 characters, and I don't want to get into BLOBs. I could use another database such as SQLite which uses strings as the datatype, but I have almost no experience with this type of solution. I do plan on implementing a solution (for the larger problem) in SQLite for comparison purposes.

      The larger problem is that I am investigating speed versus memory tradeoffs in on-line queries involving a many-to-many relationship. I have an *extremely* fast solution using Table::ParentChild, but it cannot be stored using the Storable module. This is because Table::ParentChild uses an XS data structure, which represents each parent-child relationship in four 32 bit integers. I can sacrifice a little speed and use less memory. To this end I have coded another very fast solution using hashes and arrays, and most of the memory it uses are in these strings of numbers. I think that the strings are still wasteful, since they represent only about 3 bits of information per digit and one bit or so of information per delimiter character.

      I hope to write more about the larger problem in the future, but for now I find that the core of my solution relies on these strings of numbers. The compression problem sounded familiar, so I hoped that someone would have some ideas.

      I have run across this problem before, where I wanted compression, but I have a line-oriented file and I want to be able to decompress just a single line at a time, when this line is located anywhere in the file. It seems like an ordinary problem in compression, but after searching for a few hours I didn't find anything, so I asked first in chat and then here.

      Update:
      I have found the classification of the larger problem that I am working on. It is a graph problem. My graphs are sparse and large, so I am using an adjacency list. The sets that I am compressing are rows in the adjacency list. This is described in chapter 8 of Mastering Algorithms with Perl, where the Graph modules are described.

      It should work perfectly the first time! - toma

        Its worth pointing out that attempting to compress a string containing a list of comma delimited ascii numbers using Base64, will make the string grow in size, not shrink.

        Eg. A string containing 1000 5 digit numbers separated by commas, is 6000 bytes. Encode that base64 and you get a string nearly 8K in length.

        Pack the same string using 16-bits and you get 2000 bytes.

        If you really need to make it smaller still, then you might consider using Compress::Zlib::memGzip which compresses the same ascii list into 1800 bytes.

        There are caveats associated with this 10% reduction in size. A) your compression will vary with the data set. B) Compress::Zlib::memGunzip is 10 to 15 times slower than unpack.

        If your main concern is that using pack & unpack with format 'S*', will limit you to 16-bit numbers, then move to 32-bit using 'L*'. This will pack the same 1000 numbers into 4000 bytes which makes it look not such a good option until you think about what happens when you start using 6-digit numbers.

        The string containing 100,000 .. 101,000, comma delimited ascii is 7000 bytes. Base64:9467, zipped:1817, packed 32-bit 4000. Again, Zlib is looking good, but now the figures for a random mixed of 1 to 6-digit numbers

        ascii:6458, base64:8726, zipped:3169, pack 32-bit:4000.

        Suddenly the trade for decompression speed looks less valuable, and if you move to a random mix of 8-digit numbers, Zlib starts averaging 4500 bytes for 1000 numbers, pack remains at 4000 bytes.

        Finally, with Zlib, there is a heavy penalty for small lists. Lists of 1 x 5-digits average around 25 bytes; 4 x 5-digits numbers around 45 bytes and 4 x 8-digit numbers around 75 bytes. The corresponding pack'd lists come out at 2/8 for 'S*' and 4/16/16 for 'L*'.


        Examine what is said, not who speaks.

        The 7th Rule of perl club is -- pearl clubs are easily damaged. Use a diamond club instead.

Re: Compressing a set of integers
by I0 (Priest) on Jan 26, 2003 at 16:47 UTC
    foreach( @array ){ my @a = sort{$a<=>$b}split/,/; my $d=0; for( @a ){ ($d,$_)=($_,$_-$d); } $_ = pack'w*',@a; }
Compressing a set of integers - Solution
by toma (Vicar) on Jan 27, 2003 at 05:14 UTC
    Many thanks for all the responses, especially I0 for the brilliant snippet of code and bart for the clear explanation of the benefits of pack with the "w" template. All of the responses were valuable; they really improved my understanding. I have implemented a solution along these lines and now I only need about 26% as much memory as I did with the ascii dataset. I am delighted with this solution!

    With this solution, I expect to save about 1.5 million numbers in about 2 megabytes of storage. Wow!

    Here is the proof of concept code that I ended up with:

    use Storable; use File::Flat; my @to_save; my $nrows=10000; my $rowmax=300; for (0..$nrows) { my @r; $r[0]=int(rand(20)); for (1..int(rand($rowmax))) { push @r, $r[$_-1] + 1 + int(rand(200)); } push @to_save, join ',',@r; } my $in=""; foreach( @to_save ) { $in .= $_."\n"; my @a = sort {$a<=>$b} split/,/; my $d=0; for( @a ){ ($d,$_)=($_,$_-$d); } $_ = pack'w*',@a; } open(TEXT, ">array.txt") or die "Can't open output file"; print TEXT $in; close TEXT; store (\@to_save, 'array.sto'); my $to_retrieve= retrieve ('array.sto'); my $out=""; foreach( @$to_retrieve ) { my @a = unpack'w*',$_; my $c = 0; foreach my $n (@a) { $c += $n; $out .= $c.","; } chop $out; $out .= "\n"; } if ($in eq $out) { print "OK\n" } my $sz_in= File::Flat->fileSize('array.txt') or die "Can't size array.txt"; my $sz_out= File::Flat->fileSize('array.sto') or die "Can't size array.sto"; print "Final size = ",int(100*$sz_out/$sz_in+0.5),"% of input\n";
    It should work perfectly the first time! - toma

      I'd seen, but never really taken any notice of the 'w' format for pack & unpack. Totally made for the job.

      Perhaps the only downside of using 'w' instead of 'S' or 'L' is that you need to unpack the whole string to get to an individual value. With 'S' or 'L' you could have indexed into the string with substr to read or write an individual value. How much of a penalty this is will depend on your application and whether you will always need to access all the values in any given string each time.

      Probably worth it if the savings on space are enough.


      Examine what is said, not who speaks.

      The 7th Rule of perl club is -- pearl clubs are easily damaged. Use a diamond club instead.

        It wouldn't be necessary to use substr to extract a value in a buffer full of packed shorts or longs. The '@' code can be used to move to a specific offset in the buffer before extracting.

        --- print map { my ($m)=1<<hex($_)&11?' ':''; $m.=substr('AHJPacehklnorstu',hex($_),1) } split //,'2fde0abe76c36c914586c';

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://229792]
Approved by PodMaster
Front-paged by ibanix
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others having an uproarious good time at the Monastery: (4)
As of 2023-02-04 19:16 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    I prefer not to run the latest version of Perl because:







    Results (31 votes). Check out past polls.

    Notices?