http://www.perlmonks.org?node_id=1037779

creeble has asked for the wisdom of the Perl Monks concerning the following question:

I have a Class::Struct (using array form, fwiw) containing about a dozen string members. I build a database of these by sorting them into an array.

The total length of the strings used to create this database is about 2MB; about 150 bytes of actual data times about 15,000 records, say.

As an array of Class::Struct objects, this takes up over 15MB in ram. I am unfortunately running on an embedded processor with limited ram.

How could I store these strings in a more compact way, and still use a convenient accessor function (preferably, the same one, namely a class function) to retrieve the data? The db is read-only once it has been sorted.

Or, to perhaps put it another way, what would it take to write something like Class::Struct in XS with a non-growable string as the only data type?

Or am I already missing something from CPAN? What would be a more compact form to store

Replies are listed 'Best First'.
Re: Compact data classes
by BrowserUk (Patriarch) on Jun 08, 2013 at 06:07 UTC

    This uses 4.4MB of ram to construct 15,000 objects each of 10 x 15-char fields.

    The array holding those 15k objects (containing a total of 2.15MB of string data) occupies 3.67MB of ram.

    The accessors are lvalue subs and their performance should be comparable with any other perl OO objects.

    No validation is done.

    use Class::Struct::Compact; use Devel::Size qw[ total_size ]; use Data::Dump qw[ pp ]; my @chars = ( 'a'..'z' ); sub dummy { my $n = shift; join '', @chars[ map int( rand @chars ), 1 .. $n ]; } Class::Struct::Compact->new( 'Test', F1 => 15, F2 => 15, F3 => 15, F4 => 15, F5 => 15, F6 => 15, F7 => 15, F8 => 15, F9 => 15, F10 => 15, ); printf "Class constructed: check mem: "; <>; our $N //= 15_000; my @db; push @db, Test->new( dummy( 150 ) ) for 1 .. $N; printf "Instances created: check mem: "; <>; @db = sort{ $a->F1 cmp $b->F1 } @db; #pp \@db; printf "Instances sorted: check mem: "; <>; print "total size: ", total_size \@db; for( 0, $#db ) { print "Record $_"; print "\t", $db[ $_ ]->F1; print "\t", $db[ $_ ]->F2; print "\t", $db[ $_ ]->F3; print "\t", $db[ $_ ]->F4; print "\t", $db[ $_ ]->F5; print "\t", $db[ $_ ]->F6; print "\t", $db[ $_ ]->F7; print "\t", $db[ $_ ]->F8; print "\t", $db[ $_ ]->F9; print "\t", $db[ $_ ]->F10; } __END__ C:\test>perl \perl64\site\lib\Class\Struct\Compact.pm Class constructed: check mem: 4.6MB Instances created: check mem: 9.0MB Instances sorted: check mem: 9.1MB total size: 3851232 Record 0 aaartlhgpkapaol jrbkelwlfklkjgn rhdrrltzezyuenc zxccfpxpbzcxoqy ysfqlfkrnhmaqhf vclpccofujbyars gwrdngknxjyxxni foiuaojwzrqouzc msbepsdptomdtbe qazhgrkywspzsts Record 14999 zzypdjmcgmgxnso yzygmkgabelvqlj xihybqagfiydipo fpgsaybyhrfawuc zyxekczeaxfomrs lwyannakxmgists nzehvwysfvpkeuf gggdblbshwhgnto vtdgbjvgwevpurx wtdgjumncxgfaih

    The module & test code:


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      omg, you may be my best friend. Let me check that and get back.
      Wait, where is this magical Class::Struct::Compact? It's not in CPAN afaict?

        As hdb points out, it's in the spoiler at the end of my post.

        It's not on cpan because I wrote it -- actually adapted it from some existing code -- in reponse to your OP.

        I'd want to use a few times myself and see what else it needs before putting it out for general use. For starters it needs some error checking and Carp for when things go wrong.

        It'd also be nice to use pack templates to allow for numeric fields; but then it I'd have to drop the lvalue-ness of the accessors.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
      (I just love it when I reply thinking I'm logged in, ugh.)

      This is very clever! The only drawback is the fixed record length, but the savings is so significant compared to Class::Struct's ~5x overhead I should be able to find a workable size.

      Let me PM you about error checking, etc. Thanks so very much.

      Okay, one more question for BrowserUK here which may be of benefit to other similarly-clueless sub-monks like myself.

      Why does your solution use a compiled (eval) class? Is this faster or does it save memory vs. a solution like roboticus proposes?

        Because it allows me to 'hardcode' the numbers in the substrs.

        If you uncomment the print before the eval, you'll see something like:

        package Test; use constant { F1_N => 0, F2_N => 15, F3_N => 30, F4_N => 45, F5_N => +60, F6_N => 75, F7_N => 90, F8_N => 105, F9_N => 120, F10_N => 135, } +; use constant { F1_L => 15, F2_L => 15, F3_L => 15, F4_L => 15, F5_L => + 15, F6_L => 15, F7_L => 15, F8_L => 15, F9_L => 15, F10_L => 15, }; sub new { my $class = shift; my $self = shift // ''; return bless \$self, $class } # line 1 "sub_F4" sub Test::F4 :lvalue { my $self = shift; substr( $$self, F4_N(), F4_L() ); }

        F4_N() and F4_L() are constant subs which get optimised away during compilation, leaving hardcoded numbers which are faster than variables.

        The memory saving comes from packing the fields into single strings; the performance comes from asking the sub to do as little as possible.

        That said, by explaining that, I've spotted another couple of optimisations; and a potential bug. I'll get back to you with a revised version 2 days from now.

        (A good reason for not uploading to cpan straight away :)


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Compact data classes
by roboticus (Chancellor) on Jun 08, 2013 at 00:20 UTC

    creeble:

    I can think of a few possibilities, but I don't know if any would be good because i don't know how you intend to access the values, how fast you need to look them up, etc.

    For example, you could concatenate all your strings into one long one using the scheme of having a byte containing the length of the field and the text. (You did mention that each object totals about 150 bytes of data, if I understand correctl.) Then your objects would store only the offset of the first field. Each access function would simply need to skip the appropriate number of fields.

    $ cat t.pl #!/usr/bin/perl use strict; use warnings; my $joe = kablooie::new('Joe', 'Blow'); my $jane = kablooie::new('Jane', 'Smith'); my $name = $joe->fname(); print "joe's first name: $name\n"; $name = $jane->lname(); print "jane's last name: $name\n"; package kablooie; my $kablooie_storage; BEGIN { $kablooie_storage = "Bob's your uncle"; } sub new { my ($fname,$lname) = @_; my $offset = length($kablooie_storage); $kablooie_storage .= chr(length($fname)) . $fname; $kablooie_storage .= chr(length($lname)) . $lname; return bless \$offset, 'kablooie'; } sub fname { # FName is the first field my $self = shift; my $offset = $$self; my $len = ord(substr($kablooie_storage,$offset,1)); return substr($kablooie_storage,$offset+1,$len); } sub lname { # LName is the second field my $self = shift; my $offset = $$self; my $len = ord(substr($kablooie_storage,$offset,1)); $offset+=$len+1; $len = ord(substr($kablooie_storage,$offset,1)); return substr($kablooie_storage,$offset+1,$len); }

    When run, I get:

    $ perl t.pl joe's first name: Joe jane's last name: Smith

    You could add some simple compression scheme, too. (RLE might be useful if you have lots of repeated characters, and if you don't mind restricting your character set, you could compress the characters into 6 bits each.)

    That said, I have no idea how much space you could save using this technique, and whether it would be worth it or not.

    ...roboticus

    When your only tool is a hammer, all problems look like your thumb.

      I tried a variation of that using pack and unpack, but it was unworkably slow. I do need quick access to all the fields. It did save a fair amount of memory, but less than I would have hoped.

      What I'm starting to envision is a somewhat-specialized XS module that builds something similar (I could afford pointers to the individual field strings) in C, and then allows access to it via XS. The db is read-only once it's sorted, so updates aren't an issue. You could malloc large blocks and just write null-terminated strings to them, updating the pointers for the fields.

      But I know almost nothing about XS and whether the conversion from a string in C to what perl thinks is a string would be painfully slow. I would guess not, since it must do it internally all the time?

Re: Compact data classes
by v-zor (Initiate) on Jun 10, 2013 at 01:12 UTC
    HTH : http://search.cpan.org/search?query=Gzip&mode=all
Re: Compact data classes
by sundialsvc4 (Abbot) on Jun 08, 2013 at 14:43 UTC

    Well, what about SQLite?   It’s public-domain(!), used in cell phones all over the planet, and known to be quite efficient.   Maybe you could make that your primary backing-store.   (Perhaps also this would eliminate the need for sorting?)   Because SQLite is used in many resource-constrained situations, it has a lot of options for making efficient use of memory.   (Here, I’m referring to the software itself, not just its Perl implementation ...)

    (Quite seriously, “SQLite is another Swiss Army® knife of computer programming” ...)

    A homebrew in-memory NRU caching scheme is fairly easy to construct in Perl ... if it turns out that you actually need one when using SQLite.   (Find out, first.)   A hashref can provide random access to strings, while a separate array of hashrefs (to the same Perl objects) provides for round-robin recycling:   when the array has reached its arbitrarily-set limit, shift or pop an element off, delete the key from the hashref (you must do both!), then let the now-unreferenced data disappear into the gloom while you create another key, add a reference to it to the hash, and unshift or push another reference to the same thing onto the array.   (The data-records now have a reference-count of 2, one from the hash, the other from the list, as they will for the duration.)   It’s NRU = Not Recently Used, not LRU = Least Recently Used, but it’s all resident-memory, so, so what.

      How much memory is used by constructing an SQLite DB with 15,000 records containing 10 x 15-char fields?

      You recommend this module so often; surely you know this off the top of your head?

      Even if not, it will be the work of minutes to try it out won't it?

      Come on. Prove me wrong. Show me that this isn't just another case of a keyword triggering one of your 6 remaining synapses to fire, and cause you to trot out the associated, autonomic "advice".


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

        Please limit your discussions on this forum totechnical objections, not vilifying of fellow Monks.   You have been courteously asked this before.   Thank you.

      And just for the record, while I appreciate that SQLite is free and easy, I'm not looking for a "backing store" per se -- I need this 'db' to be all in ram, and my experience with SQLite using an in-ram database hasn't shown it to be particularly memory-efficient.

      That said, I'd be happy to be proven wrong; I haven't used SQLite from Perl in many years.