toma has asked for the wisdom of the Perl Monks concerning the following question:
I have a large array that contains sets of integers.
array[0]="1,100,105,1003";
array[1]="42,314,2718,30302";
etc...
I would like to save RAM with a more efficient representation
of these sets. Here are a few facts about the numbers:
- The numbers do not repeat
- The numbers represent an index into another array
- I can reorder the data if I really need to
- Some numbers are much more common than others
- I don't care much about how long it takes to compress the data
- I would like to decompress the data relatively quickly
- In some cases there are thousands of entries per row,
other cases only one.
The best answer might be to just use a database. I was
thinking about a RAM-based solution in an attempt to get
blazing speed in a mod_perl application. It would be
nice if I could look up and decode about 100 of these rows
in 0.05 seconds or so.
I would also like to be able to load these numbers into
RAM relatively quickly, so I would like a data structure
that can be stored and restored with Storable or some other quick
method.
I am looking for ideas, pointers to algorithms, modules,
or code.
It seems like a fairly fundamental area of computer
science, but I don't know what this area is called.
Thanks!
It should work perfectly the first time! - toma
Re: Compressing a set of integers
by BrowserUk (Patriarch) on Jan 25, 2003 at 05:33 UTC
|
Just storing your number in binary rather than ascii with delimiters would cut your memory requirements to about 1/3 on average. Assuming that your example is representative and your numbers will fit in 16-bits. You mentioned the figures of 50,000 sets and assuming 4 per set as shown.
Instead of approx 1 MB, your down to 390k, ignoring the cost of the arrays which should be constant That's a start.
Projecting the timing of the code below gives around 3 seconds to pack the 50k x 4 x 16 bits and 0.002 secs to unpack 100 lines, which comfortably fits your spec, and my machine is hardly the quickest around.
#! perl -slw
use strict;
use Benchmark::Timer;
my $t= Benchmark::Timer->new();
my (@ary1, @ary2);
#! 1000 lines of 4 random numbers (0-65536) in ascii with commas.
@ary1 = map{ join ',', map{ int rand(2**16) } 1..4 } 1 .. 1000;
print $ary1[0];
$t->start('packing');
#! The same numbers stored as binary strings
@ary2 = map{ pack 'S4', split',', $_ } @ary1;
$t->stop('packing');
print length $ary2[0];
print '1000 in ascii: ', scalar @ary1, ' : ', length "@ary1";
print '1000 in binary: ', scalar @ary2, ' : ', length "@ary2";
$t->start('unpacking');
for (@ary2) {
my @line_of_nums = unpack 'S4', $_;
# print "@line_of_nums";
}
$t->stop('unpacking');
$t->report();
__END__
C:\test>229792
52318,4206,9238,48758
8
1000 in ascii: 1000 : 22310
1000 in binary: 1000 : 8000
1 trial of packing (50ms total)
1 trial of unpacking (20ms total)
Examine what is said, not who speaks.
The 7th Rule of perl club is -- pearl clubs are easily damaged. Use a diamond club instead. | [reply] [d/l] |
|
This is good, but since he said the number of entries per row may vary, he should pack "S*" instead of pack "S4"
| [reply] [d/l] [select] |
Re: Compressing a set of integers
by bart (Canon) on Jan 27, 2003 at 01:34 UTC
|
You say you don't know if your numbers will fit into 16 bit, and using 32 bit per number simply doubles storage space.
May I point you to the fact that there is an option in pack/unpack to store numbers with built in compression? They're called "BER-compressed integers" (see perlfunc:pack — use the "w" template), and they use a few bytes as possible. They're slightly less efficient in storage on the maximum number, as only 7 bits per byte are significant. But on the plus side: they don't waste any bytes they don't need, so for not so extreme values, the waste is no more than with the fixed length representation, and often less.
Experimentally I've determined the following ranges for the byte count:
range | byte count |
0 - 127 | 1 |
128 - 16383 | 2 |
16384 - 2097151 | 3 |
2097152 - 268435455 | 4 |
268435456 - 4294967295 (?) | 5 |
The range for 5 bytes runs at least up to 4294967295, i.e. 2**32-1, but above that, something goes wrong, at least on Indigoperl 5.6.1/Win32. Suddenly, I get a value for 2**32 that doesn't make any sense: it takes 10 bytes, for a start, while 2**32-1 only takes 5. In theory, there's no upper limit for which you could use this kind of representation, so I have no idea on what's going on.
That problem aside, here's an extra way you can save on space: sort the numbers in ascending order, and store the difference between successing numbers — the first value being the smallest number itself, the difference from zero. The more numbers you have, the closer they'll be together, and that way, you might shave of a few bytes.
| [reply] |
Re: Compressing a set of integers
by PhiRatE (Monk) on Jan 26, 2003 at 02:04 UTC
|
since someone else has already posted a reasonable solution, I'll add a couple of notes:
When discussing numbers, it is important to note the boundary. You said integer, but are negatives possible? is the range outside the 16bit signed limit? 32bit signed limit? is there any limit at all?
When looking for performance solutions, you should also provide information on the platform. Do you need portability? are you on a win32 or unix platform? are you capable of understanding C code? (an Inline::C function would make short work of your problem)
And if possible, providing context is valuable. There have been numerous times when people have posted a very limited description, asking a very targetted question, and it was found that the entire operation could be avoided by use of a different overall strategy. Context helps experienced monks identify those situations.
| [reply] |
|
The numbers represent an index into another array.
The numbers are positive. I could offset the numbers
by a fixed amount if it would help somehow.
I have about 50,000 numbers today, but I don't want
the code to have a 16 bit limit since I expect the
dataset to grow.
I want to be able to store the data with Storable,
and I think that it will not support a custom
Inline::C representation. If there is another
high-performance way to store the data that can
take advantage of Inline::C, then that would be
a candidate.
My platform is somewhat crufty. It is HP-UX 10.20
for the production web server and linux as the
data generator. For generality and future-proofing,
I prefer portable code, but this is not a hard
requirement.
So far, my main ideas are:
- Translate the numbers to base64 to make them
more compact in string form.
- Reorder the data so that more common numbers
are smaller and are therefore shorter in string form.
- Create and optimize a state machine to make a
real compressor tuned for this special set of characters.
I have been reluctant to use our database for this
task because the lists of numbers can get quite
long. I have a VARCHAR limit of 255 characters, and
I don't want to get into BLOBs. I could use another
database such as SQLite which uses strings as the
datatype, but I have almost no experience with this
type of solution. I do plan on implementing a
solution (for the larger problem) in SQLite for
comparison purposes.
The larger problem is that I am investigating
speed versus memory tradeoffs in on-line queries
involving a many-to-many relationship.
I have an *extremely*
fast solution using Table::ParentChild,
but it cannot be stored using the Storable module.
This is because Table::ParentChild uses an XS data
structure, which represents each parent-child relationship
in four 32 bit integers. I can sacrifice a little
speed and use less memory. To this end
I have coded another very fast solution using hashes
and arrays, and most of the memory it uses are in
these strings of numbers. I think that the strings
are still wasteful, since they represent only about 3 bits
of information per digit and one bit or so of information
per delimiter character.
I hope to write more about the larger problem in the
future, but for now I find that the core of my
solution relies on these strings of numbers.
The compression problem sounded familiar, so I hoped that
someone would have some ideas.
I have run across this problem before, where I
wanted compression, but I have a line-oriented
file and I want to be able to decompress just
a single line at a time, when this line is
located anywhere in the file.
It seems like an ordinary problem in compression,
but after searching for a few hours I didn't find anything,
so I asked first in chat and then here.
Update:
I have found the classification of the larger problem
that I am working on. It is a graph problem. My graphs
are sparse and large, so I am using an adjacency list.
The sets that I am compressing are rows in the adjacency
list. This is described in chapter 8 of
Mastering Algorithms with Perl,
where the Graph modules are described.
It should work perfectly the first time! - toma
| [reply] |
|
Its worth pointing out that attempting to compress a string containing a list of comma delimited ascii numbers using Base64, will make the string grow in size, not shrink.
Eg. A string containing 1000 5 digit numbers separated by commas, is 6000 bytes. Encode that base64 and you get a string nearly 8K in length.
Pack the same string using 16-bits and you get 2000 bytes.
If you really need to make it smaller still, then you might consider using Compress::Zlib::memGzip which compresses the same ascii list into 1800 bytes.
There are caveats associated with this 10% reduction in size. A) your compression will vary with the data set. B) Compress::Zlib::memGunzip is 10 to 15 times slower than unpack.
If your main concern is that using pack & unpack with format 'S*', will limit you to 16-bit numbers, then move to 32-bit using 'L*'. This will pack the same 1000 numbers into 4000 bytes which makes it look not such a good option until you think about what happens when you start using 6-digit numbers.
The string containing 100,000 .. 101,000, comma delimited ascii is 7000 bytes. Base64:9467, zipped:1817, packed 32-bit 4000. Again, Zlib is looking good, but now the figures for a random mixed of 1 to 6-digit numbers
ascii:6458, base64:8726, zipped:3169, pack 32-bit:4000.
Suddenly the trade for decompression speed looks less valuable, and if you move to a random mix of 8-digit numbers, Zlib starts averaging 4500 bytes for 1000 numbers, pack remains at 4000 bytes.
Finally, with Zlib, there is a heavy penalty for small lists. Lists of 1 x 5-digits average around 25 bytes; 4 x 5-digits numbers around 45 bytes and 4 x 8-digit numbers around 75 bytes. The corresponding pack'd lists come out at 2/8 for 'S*' and 4/16/16 for 'L*'.
Examine what is said, not who speaks.
The 7th Rule of perl club is -- pearl clubs are easily damaged. Use a diamond club instead.
| [reply] |
|
Re: Compressing a set of integers
by I0 (Priest) on Jan 26, 2003 at 16:47 UTC
|
foreach( @array ){
my @a = sort{$a<=>$b}split/,/;
my $d=0;
for( @a ){ ($d,$_)=($_,$_-$d); }
$_ = pack'w*',@a;
}
| [reply] [d/l] |
Compressing a set of integers - Solution
by toma (Vicar) on Jan 27, 2003 at 05:14 UTC
|
Many thanks for all the responses, especially
I0 for the brilliant snippet of code and
bart for the clear explanation
of the benefits of pack with the "w"
template. All of the responses were valuable;
they really improved my understanding.
I have implemented a solution
along these lines and now I only need about
26% as much memory as I did with the ascii
dataset. I am delighted with this solution!
With this solution, I expect to save about 1.5 million
numbers in about 2 megabytes of storage. Wow!
Here is the proof of concept code that I
ended up with:
use Storable;
use File::Flat;
my @to_save;
my $nrows=10000;
my $rowmax=300;
for (0..$nrows) {
my @r;
$r[0]=int(rand(20));
for (1..int(rand($rowmax))) {
push @r, $r[$_-1] + 1 + int(rand(200));
}
push @to_save, join ',',@r;
}
my $in="";
foreach( @to_save ) {
$in .= $_."\n";
my @a = sort {$a<=>$b} split/,/;
my $d=0;
for( @a ){ ($d,$_)=($_,$_-$d); }
$_ = pack'w*',@a;
}
open(TEXT, ">array.txt") or die "Can't open output file";
print TEXT $in;
close TEXT;
store (\@to_save, 'array.sto');
my $to_retrieve= retrieve ('array.sto');
my $out="";
foreach( @$to_retrieve ) {
my @a = unpack'w*',$_;
my $c = 0;
foreach my $n (@a) {
$c += $n;
$out .= $c.",";
}
chop $out;
$out .= "\n";
}
if ($in eq $out) {
print "OK\n"
}
my $sz_in= File::Flat->fileSize('array.txt')
or die "Can't size array.txt";
my $sz_out= File::Flat->fileSize('array.sto')
or die "Can't size array.sto";
print "Final size = ",int(100*$sz_out/$sz_in+0.5),"% of input\n";
It should work perfectly the first time! - toma | [reply] [d/l] |
|
I'd seen, but never really taken any notice of the 'w' format for pack & unpack. Totally made for the job.
Perhaps the only downside of using 'w' instead of 'S' or 'L' is that you need to unpack the whole string to get to an individual value. With 'S' or 'L' you could have indexed into the string with substr to read or write an individual value. How much of a penalty this is will depend on your application and whether you will always need to access all the values in any given string each time.
Probably worth it if the savings on space are enough.
Examine what is said, not who speaks.
The 7th Rule of perl club is -- pearl clubs are easily damaged. Use a diamond club instead.
| [reply] |
|
---
print map { my ($m)=1<<hex($_)&11?' ':'';
$m.=substr('AHJPacehklnorstu',hex($_),1) }
split //,'2fde0abe76c36c914586c';
| [reply] [d/l] |
|
|