Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

Re^5: Serializing a large object

by BrowserUk (Pope)
on Sep 25, 2010 at 20:14 UTC ( #862001=note: print w/ replies, xml ) Need Help??


in reply to Re^4: Serializing a large object
in thread Serializing a large object

Then I still don't understand the returns from num_ranges_containing().

Given the input you describe, [7,10] matches [5,10], [5,1], [6,1], & [7,2], but it returns 1 (not 4)?

c:\test>861961 12345 1.. 5 123456 1.. 6 234567 2.. 7 345678 3.. 8 456789 4.. 9 567890 5..10 1 567890 5.. 1 1 67890 6.. 1 12 7890 7.. 2 123 890 8.. 3 1234 90 9.. 4 12345 0 10.. 5 1234 1.. 4 RM1: range: 1 .. 4 is contained by 4 + ranges 2345 2.. 5 RM1: range: 2 .. 5 is contained by 4 + ranges 3456 3.. 6 RM1: range: 3 .. 6 is contained by 3 + ranges 4567 4.. 7 RM1: range: 4 .. 7 is contained by 3 + ranges 5678 5.. 8 RM1: range: 5 .. 8 is contained by 3 + ranges 6789 6.. 9 RM1: range: 6 .. 9 is contained by 2 + ranges 7890 7..10 RM1: range: 7 .. 10 is contained by 1 + ranges

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.


Comment on Re^5: Serializing a large object
Select or Download Code
Re^6: Serializing a large object
by daverave (Scribe) on Sep 25, 2010 at 21:47 UTC
    You're right, that's a bug. I think replacing the grep with a one distinguishing between two cases solves it. Here is the updated package:

    use strict; use warnings; package FastRanges; sub new($$$) { my $class = shift; my $max_length = shift; my $ranges_a = shift; my @lookup; for ( @{$ranges_a} ) { my ( $start, $end ) = @$_; my @idx = $end >= $start ? $start .. $end : ( $start .. $max_length, 1 .. $end ); for my $i (@idx) { $lookup[$i] .= pack 'L', $end } } bless \@lookup, $class; } sub num_ranges_containing($$$) { my $self = shift; my ( $start, $end ) = @_; # query range coordinates return 0 unless ( defined $self->[$start] ) ; # no ranges overlap the start position of the query if ( $end >= $start ) { # query range is simple # any inverted range in {LOOKUP}[$start] must contain it, # and so does any simple range which ends at or after $end return 0 + grep { $_ < $start or $end <= $_ } unpack 'L*', $self->[$start]; } else { # query range is inverted # only inverted ranges in {LOOKUP}[$start] which also end # at of after $end contain it. simple ranges can't contain # the query range return 0 + grep { $_ < $start and $end <= $_ } unpack 'L*', $self->[$start]; } } 1;
    Perhaps this could be done a bit more efficiently, I'm not sure.

    Thanks for noting the bug, BrowserUK! I'm sure you saved me a lot of future frustration.

    And if you indeed think it now works and also pretty much optimized, I'd be happy to get some ideas for the original question.

      I've spent a couple of days thinking about this and playing with alternative solutions, but nothing comes close to being a fast as what you're doing now.

      The problem of course, is that what you're doing now trades truly huge amounts of space for its speed. And so we come back to the problem of how to avoid having to either:

      • Save and load the huge lookup table quickly.
      • Or, generate it from the more compact representation more quickly.

      And given the huge limits you are working with--30,000 X ranges from 0 .. 15,000,000--I don't see a solution. The presence of "inverted" ranges, in concert with the 15e6 overall range, defy all my attempts at a more compact representation. And I don't believe you'll beat Storable easily either.

      In short. I have nothing to offer your original question beyond one possibility.

      There is a data structure, QuadTree that might come close to achieving the performance of your current solution, but use far less memory. The problem is that writing such a solution in terms of Perl's existing data structures to achieve a pure perl solution isn't effective.

      You would need to obtain a C-library and write a wrapper for it. And there is no guarantee that the solution would perform as well.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        Thank you BrowserUk. I greatly appreciate your kind attention.

        I guess I will stick with the original, space-consuming implementation. Does it makes sense to compress the store files? If so, can you think of any particular module/method that is most appropriate for compressing this kind of data (I usually use IO::Compress::Gzip but I know there are other options too.

        Thanks again for your help,
        Dave.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://862001]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others cooling their heels in the Monastery: (12)
As of 2014-09-16 16:07 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (33 votes), past polls