Beefy Boxes and Bandwidth Generously Provided by pair Networks Cowboy Neal with Hat
P is for Practical
 
PerlMonks  

Serializing a large object

by daverave (Scribe)
on Sep 25, 2010 at 14:32 UTC ( #861961=perlquestion: print w/ replies, xml ) Need Help??
daverave has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

I'm using the following package, which was adapted from http://stackoverflow.com/questions/3790166/, to store a large list of ranges and allow time-efficient "how-many-ranges-cover-this-given-range" queries.

use strict; use warnings; package RangeMap; # Credit: Aristotle Pagaltzis! sub new($$$) { my $class = shift; my $max_length = shift; my $ranges_a = shift; my @lookup; for (@{$ranges_a}) { my ( $start, $end ) = @$_; my @idx = $end >= $start ? $start .. $end : ( $start .. $max_length, 1 .. $end ); for my $i (@idx) { $lookup[$i] .= pack 'L', $end } } bless \@lookup, $class; } sub num_ranges_containing($$$) { my $self = shift; my ( $start, $end ) = @_; return 0 unless (defined $self->[$start]); return 0 + grep { $end <= $_ } unpack 'L*', $self->[$start]; } 1;
After creating an object I store it using nstore for future use. This usually results in a binary file of size 500MB-2GB.

So, I wonder if I could somehow store it more compactly to save on disk space, without slowing load time too much.

I know it's a give and take - and I'm willing to pay in a longer load (retrieve) time, since my common usage would be a retrieval followed by a few million queries.

Some more details which might be relevant: the number of ranges is usually around 30k, max_length varies significantly between 20k to 15M, where most commonly it's around 3-4M.

Thanks,
Dave

Comment on Serializing a large object
Download Code
Re: Serializing a large object
by BrowserUk (Pope) on Sep 25, 2010 at 17:19 UTC

    To store the data in a more compact form, it's necessary to understand the data so you can look for more compact representations of the information it represents. And for now, that isn't clear (to me).

    I set up a small set of input ranges--of both the normal--$start <= $end type; and the inverted $end < $start type; and then tried to make sense of the numbers returned by num_ranges_containing(). And I cannot.

    I input these ranges:

    my @ranges = ( [ 0, 5 ], [ 1, 6 ], [ 2, 7 ], [ 3, 8 ], [ 4, 9 ], [ 5, 10 ], [ 5, 0 ], [ 6, 1 ], [ 7, 2 ], [ 8, 3 ], [ 9, 4 ], [ 10, 5 ], );

    And then asked for the counts containing the ranges: [0,3], [1,4], ...., [7,10], and as the returns didn't add up, I did a simple plot:

    c:\test>861961 ------ 0.. 5 ------ 1.. 6 ------ 2.. 7 ------ 3.. 8 ------ 4.. 9 ------ 5..10 ----- 5.. 0 - ---- 6.. 1 -- --- 7.. 2 --- -- 8.. 3 ---- - 9.. 4 ----- 10.. 5 ---- 0.. 3 range: 0 .. 3 is contained by 1 ranges ---- 1.. 4 range: 1 .. 4 is contained by 4 ranges ---- 2.. 5 range: 2 .. 5 is contained by 4 ranges ---- 3.. 6 range: 3 .. 6 is contained by 3 ranges ---- 4.. 7 range: 4 .. 7 is contained by 3 ranges ---- 5.. 8 range: 5 .. 8 is contained by 3 ranges ---- 6.. 9 range: 6 .. 9 is contained by 2 ranges ---- 7..10 range: 7 .. 10 is contained by 1 range +s

    Looking at just a couple:

    • [0,3] returns 1;

      But to my eyes, it appears to be contained by at least: the first, and last two input ranges?

    • [1,4] returns 4;

      But only appears to be contained by the first two, and last input ranges?

    How am I misinterpreting the data?


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      The ranges are given in biological coordinates, meaning the first coordinate is 1 (0 is illegal) and max_length is a legal coordinate. So, if max_length=10 then our coordinates are in 1..10 (both inclusive). Also note that a range like [2,4] expands to2,3,4 since both start and end are inclusive.

      This convention always causes some trouble, and most of the time I use to convert the coordinates at the beginning and at the end so I can work with 0-based coordinates. In this case I didn't since it's quite simple, so I'm working with biological coordinates.

      Anyway, if we now take your example and arbitrarily replace all 0's with 1's we get:

      my @ranges = ([ 1, 5 ], [ 1, 6 ], [ 2, 7 ], [ 3, 8 ], [ 4, 9 ], [ 5, +10 ],[ 5, 1 ], [ 6, 1 ], [ 7, 2 ], [ 8, 3 ], [ 9, 4 ], [ 10, 5 ],); my $rm = RangeMap->new( 10, \@ranges );

      Now, [1,3] returns 5; since only the first two and last three ranges contain it.

      [1,4] returns 4; since only the first two and last two ranges contain it.

      I hope it makes sense now

        So, an inverted range like [9, 4] includes: 1,2,3,4 & 9,10?

Re: Serializing a large object
by BrowserUk (Pope) on Oct 03, 2010 at 06:48 UTC

    Are you still looking for a solution to this problem?

    I think I might have one, but it would take some work to test it. I'd need you to supply me with a test dataset and results.

      I am currently using nstore with a fd opened using gzip IO layer. Other solutions will be most welcomed. What exactly would you like me to supply and where should I put it?

        A small set (say 1000 or so) "typical" input ranges; and a hundred or so test ranges along with the result counts when compared against the supplied input set. Timings of how long it took to run would also be useful.

        As for how to exchange them, email seems possible. One set of 1000 pairs; plus 1 set of 100 hundred pairs + counts; plus a time won't take much space. You could probably even post them here. /msg me for a email address if you want to go that route.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://861961]
Approved by NetWallah
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others meditating upon the Monastery: (6)
As of 2014-04-19 01:00 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    April first is:







    Results (475 votes), past polls