Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

Specialized data compression

by wanna_code_perl (Pilgrim)
on Sep 11, 2012 at 00:56 UTC ( #992886=perlquestion: print w/ replies, xml ) Need Help??
wanna_code_perl has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks,

I have an accelerometer with onboard firmware that generates CSV files with records at 80Hz but only when there is activity. Empirical analysis suggests the mean rate is around 6.0Hz. The CSV has records like this:

    17077.395763,-739,1059,-16734

If you want to see a sample data file, you can download a <5 hour sample here: accel.csv (3.0MB CSV).

The columns are seconds (since recording began), and x,y,z accelerometer values in 16384*g signed 16-bit precision.

What I've Tried

Now, the problem is these files are a rather inefficient use of space. To the extent that disk is (not) cheap in my application, I need to reduce their size.

As a first cut, I post-process the files with pack('fsss',...), which gives about 3:1 reduction. Note that the last two digits of the timestamp are spurious, so it can be converted to a single-precision float.

Further compressing the output with xz or bzip2 brings that up to about 5.5:1 (compression alone without pack() was about 4:1).

Finally I started conditionally storing the time as 8-bit delta 1/10000 second values if the delta from the previous record is sufficiently small (at 80Hz, it is), otherwise I store a (magic) 0 followed by unsigned 32-bit 1/10000 seconds. Hence, the record is either Csss (98.5% of records) or CLsss (1.5% of records). This brought the ratio up to 6.5:1 after compression with xz, at the cost of a little more complexity.

The Question

Can I improve significantly on 6.5:1 before I move on to lossy methods, such as reducing the frequency and resolution?

Comment on Specialized data compression
Select or Download Code
Re: Specialized data compression
by Anonymous Monk on Sep 11, 2012 at 01:26 UTC
    Have you tried 7zip?

      Yes, but it performed slightly worse than xz, most likely because it uses the same compression algorithm.

      -rw-r--r-- 1 469K 10 Sep 19:33 accel_packed.7z -rw-r--r-- 1 559K 10 Sep 19:33 accel_packed.bz2 -rw-r--r-- 1 559K 10 Sep 19:33 accel_packed.gz -rw-r--r-- 1 468K 10 Sep 19:33 accel_packed.xz -rw-r--r-- 1 560K 10 Sep 19:33 accel_packed.zip # Uncompressed file, packed with Csss/CLsss: -rw-r--r-- 1 701K 10 Sep 18:34 accel_packed.dat

      These are all about +/- 10%. Unless there's a significantly different and amazing general compression algorithm I haven't considered, I would imagine these results won't improve much, which is why I'm asking here to see if there's a way I can further optimize the data itself.

Re: Specialized data compression
by GrandFather (Cardinal) on Sep 11, 2012 at 08:50 UTC

    Could 8 bit deltas for the x, y, z values gain you anything? If acceleration tends to be straight line then converting to a vector form may gain you something, especially if you then delta encode.

    A quite different technique that works well for some types of data is to treat each bit position (or groups of adjacent bits - nibbles for example) in the data as a separate stream and run length encode each "stream". That gives excellent compression for the most significant bits and almost none for the least significant bits, but overall may give substantially better results than general purpose Huffaman type encoding.

    True laziness is hard work

      Thanks for the suggestions! It took me a few days to think this through and try some things.

      Indeed, per your suggestion I have tried deltas for x,y,z, although they don't all fit in 8 bits (some deltas will of course overflow 16-bit as well), so I use a variable width int (8, 16, or 32 bit), which was a pretty big win, even after compression.

      The bitwise transposition you refer to is interesting, but was complex and didn't change things much once xz got ahold of it.

Re: Specialized data compression
by salva (Abbot) on Sep 11, 2012 at 09:28 UTC
    The x,y,z acceleration values in your sample file are highly correlated on time, so using delta values for them should also improve their compressibility.

    Then, you can divide the x,y,z deltas by the time deltas in order to get the derivate of the acceleration vector in order to get even more normalized values. The problem is that at this point you are using floats instead of integers, but unless you want absolute precision this should be a minor issue. Just select a resolution that fits your problem and discretize them accordingly.

    Another possibility is to express the x,y,z deltas (or derivates) in the coordinate system defined by the Frenet-Serret frame. That would probably produce numbers that are nearer to zero and so, more compressible.

      These suggestions were helpful. Thank you! Using deltas for x,y,z did help quite a bit.

      The time delta derivative approach was interesting to think about, but would basically break the lossless compression I have going. At the moment, at least, I need to be able to reconstruct the original input verbatim, and the only way to do that is with fixed point arithmetic.

      The one concession I was able to safely (verifiably) make is quantizing the time deltas that are close to the sensor frequency. The clock has a large (+/- 15%!) but predictable jitter pattern I was able to isolate and then basically RLE the result.

      With the latest optimizations, I'm sitting at about 9:1 compression, which is acceptable. I was shooting for 10:1, but that last 33 KiB won't come easy.

Re: Specialized data compression
by Kenosis (Priest) on Sep 11, 2012 at 19:36 UTC

    Would it help to convert your decimal values to a higher base, e.g., base 64? Consider the following:

    use Modern::Perl; use Math::BaseCnv; my $original = <<END; 17077.161362,-595,2619,-15202 17077.174405,601,2115,-14859 17077.187416,-816,1894,-15539 17077.200402,210,1790,-15284 17077.213445,1560,1721,-15457 17077.226432,1408,1483,-15834 17077.239474,947,1161,-17531 17077.252517,689,1396,-17515 17077.265504,-861,1206,-18238 17077.278547,-493,1369,-17951 17077.291589,850,1294,-17360 17077.304576,1110,1215,-17271 17077.317619,326,1055,-13948 17077.330662,1980,1083,-16239 17077.343648,1295,1057,-16154 17077.356691,-1302,1207,-16277 17077.369734,-1986,1216,-16537 17077.382721,-134,1108,-17409 17077.395763,-739,1059,-16734 17077.408774,-2262,1006,-16690 17077.421761,-807,1174,-17046 17077.434803,-591,1081,-17615 17077.447846,-2080,-5757,-21463 17077.460833,-209,-17608,-19409 END my @original = split ' ', $original; say for @{ b10to64( \@original ) }; sub b10to64 { my $original = $_[0]; chomp @$original; for ( my $i = 0 ; $i < @$original ; $i++ ) { my @entries = split ',', $$original[$i]; my ( $whole, $decimal ) = split '\.', $entries[0]; $entries[0] = cnv( $whole, 10, 64 ) . '.' . cnv( $decimal, 10, + 64 ); $entries[$_] = cnv( $entries[$_], 10, 64 ) for 1 .. 3; $$original[$i] = ( join ',', @entries ); } $original; }

    Output:

    4Ar.dPI,-9J,ex,-3jY 4Ar.gb5,9P,X3,-3eB 4Ar.jmO,-Cm,Tc,-3op 4Ar.mxI,3I,R.,-3kq 4Ar.q75,OO,Qv,-3nX 4Ar.tI0,M0,NB,-3tQ 4Ar.wTo,Ep,I9,-4Hx 4Ar.zfb,An,Lq,-4Hh 4Ar.10qW,-DT,Is,-4S. 4Ar.140J,-7j,LP,-4OV 4Ar.17C5,DI,KE,-4FG 4Ar.1AN0,HM,I_,-4Dt 4Ar.1DYp,56,GV,-3Py 4Ar.1Gkc,Uy,Gx,-3zl 4Ar.1JvW,KF,GX,-3yQ 4Ar.1N5J,-KM,It,-3.L 4Ar.1QH6,-V2,J0,-42P 4Ar.1TS1,-26,HK,-4G1 4Ar.1Wdp,-BZ,GZ,-45U 4Ar.1Zp6,-ZM,Fk,-44o 4Ar.1c.1,-Cd,IM,-4AM 4Ar.1g9p,-9F,Gv,-4JF 4Ar.1jLc,-WW,-1Pz,-5FN 4Ar.1mWX,-3H,-4J8,-4lH

    Conversion artifact:

    4Ar.1c.1,-Cd,IM,-4AM ^ | + - split on first '.' when converting back to base 10

    This yielded a 30% reduction size of your data in accel.csv, which may help with your final data size after packing and compression.

    Update: Evidently, this would not help...

Re: Specialized data compression
by danaj (Monk) on Sep 17, 2012 at 17:45 UTC

    The Data::BitStream and Data::BitStream::XS modules may be of some help (shameless plug since I'm the author). They're made for supporting the sort of compression you're looking at.

    Using Adaptive Rice coding yields about 6.8:1, Exponential Golomb with best parameters about 6.9:1. xz/bzip2/gzip aren't able to help much with either result, yielding only about 7:1. So sadly not "significant" vs. 6.5:1.

    One advantage of this is that, like your packing, it writes the values compressed, so no second stage of running xz needed.

    There are lots of ways you could tweak the variable length output. Adaptive Rice works pretty well without a lot of thought about the parameters (the initial values used don't matter much as they'll adjust quickly). You could get complicated with Comma / Taboo, StartStop / StartStepStop, etc. codes if you wanted. It's also easy to do lossy coding by shifting the deltas before encoding and back again after decoding (taking care to keep symmetry in compressor/decompressor).

    Quick example using your CSV:

    #!/usr/bin/perl use warnings; use strict; use Data::BitStream::XS; use Text::CSV; use autodie; my $use_arice = 1; my $csv = Text::CSV->new({ sep_char => ',' }); { my $file = 'accel.csv'; open(my $data, '<', $file); my @prev = (0, 0, 0, 0); my @rice = (4, 4, 4, 4); my @egol = (8, 6, 6, 5); my $stream = Data::BitStream::XS->new( file => 'accel.gamma', mode = +> 'w' ); while (my $line = <$data>) { next if $line =~ /^;/; $line =~ s/[\r\n]//g; die "Bad CSV: $line" unless $csv->parse($line); my @fields = $csv->fields(); $fields[0] = int($fields[0] * 10000 + 0.5); # Drop last two digit +s foreach my $n (0..3) { my $p = $prev[$n]; my $v = $fields[$n]; my $delta = $v - $p; my $udelta = ($n == 0) ? $delta : ($delta >= 0) ? 2*$delta : -2*$delt +a-1; $use_arice ? $stream->put_arice($rice[$n], $udelta) : $stream->put_expgolomb($egol[$n], $udelta); } @prev = @fields; } close($data); $stream->write_close; } # Verify we can read the stream back and write a CSV file { my @prev = (0, 0, 0, 0); my @rice = (4, 4, 4, 4); my @egol = (8, 6, 6, 5); my $stream = Data::BitStream::XS->new( file => 'accel.gamma', mode = +> 'ro' ); my $file = 'new.csv'; open(my $data, '>', $file); while (my $v = $use_arice ? $stream->get_arice($rice[0]) : $stream->get_expgolomb($egol[0]) ) { my $time = $prev[0] + $v; $prev[0] = $time; printf $data "%.4f", ($time/10000); foreach my $n (1..3) { my $udelta = $use_arice ? $stream->get_arice($rice[$n]) : $stream->get_expgolomb($egol[$n]); my $delta = (($udelta & 1) == 0) ? $udelta >> 1 : -(($udelta+1) +>> 1); my $v = $prev[$n] + $delta; print $data ',', $v; $prev[$n] = $v; } print $data "\n"; } close($data); }

Re: Specialized data compression
by QM (Vicar) on Oct 04, 2012 at 09:34 UTC
    Thanks for posting such an interesting question!

    Another idea I've seen work is to transpose your CSV, so that the first row is the timestamp, the second row is the X, etc. Then compress this. As the X's may be self-similar, and the Y's self-similar, but the X's and Y's are not, this may give the compression algorithm a break.

    The timestamp should also be delta, except the first reading on resume.

    -QM
    --
    Quantum Mechanics: The dreams stuff is made of

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://992886]
Front-paged by NetWallah
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others browsing the Monastery: (18)
As of 2014-12-18 09:36 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (48 votes), past polls