|Perl: the Markov chain saw|
Specialized data compressionby wanna_code_perl (Pilgrim)
|on Sep 11, 2012 at 00:56 UTC||Need Help??|
wanna_code_perl has asked for the
wisdom of the Perl Monks concerning the following question:
I have an accelerometer with onboard firmware that generates CSV files with records at 80Hz but only when there is activity. Empirical analysis suggests the mean rate is around 6.0Hz. The CSV has records like this:17077.395763,-739,1059,-16734
If you want to see a sample data file, you can download a <5 hour sample here: accel.csv (3.0MB CSV).
The columns are seconds (since recording began), and x,y,z accelerometer values in 16384*g signed 16-bit precision.
What I've Tried
Now, the problem is these files are a rather inefficient use of space. To the extent that disk is (not) cheap in my application, I need to reduce their size.
As a first cut, I post-process the files with pack('fsss',...), which gives about 3:1 reduction. Note that the last two digits of the timestamp are spurious, so it can be converted to a single-precision float.
Further compressing the output with xz or bzip2 brings that up to about 5.5:1 (compression alone without pack() was about 4:1).
Finally I started conditionally storing the time as 8-bit delta 1/10000 second values if the delta from the previous record is sufficiently small (at 80Hz, it is), otherwise I store a (magic) 0 followed by unsigned 32-bit 1/10000 seconds. Hence, the record is either Csss (98.5% of records) or CLsss (1.5% of records). This brought the ratio up to 6.5:1 after compression with xz, at the cost of a little more complexity.
Can I improve significantly on 6.5:1 before I move on to lossy methods, such as reducing the frequency and resolution?