Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"

Re^2: Optimizing a double loop

by BrowserUk (Pope)
on Jun 04, 2010 at 01:08 UTC ( #843037=note: print w/replies, xml ) Need Help??

in reply to Re: Optimizing a double loop
in thread Optimizing a double loop

GrandFather's right as far as IO being the dominant cost here.

So why not avoid the IO completely.

Rather than writing the numbers to disk and then reading them back to accumulate the sums and sum of squares, do the accumulation in the original process?

The cost of accumulating these in the original process will be far less than writing the values to disk.

Replies are listed 'Best First'.
Re^3: Optimizing a double loop
by roibrodo (Sexton) on Jun 04, 2010 at 12:52 UTC

    Let me describe the complete scenario and try to make things more clear.

    I start by generating 1000 sets of ranges. Generating the sets is done using some other data that needs parsing etc., but it is done only once. From now on, the sets are fixed and I do not change them.

    I chose to store each set of ranges as a hash with key= range start point, value = array of lengths of all ranges that start at that start point. For example, the set of ranges ((100,200),(50,70),(50,100)) will be stored as: {100=>[100], 50=>[20,50]}

    Each of these sets is stored to disk. I can then generate a "counts array" by retrieving the required set, converting it into an inversion list (as moritz so kindly suggested) and iterating the whole range once. This works great. I can also do the same for each of the 1000 sets and aggregate the results for each position, then derive the mean and standard deviation for each position in the whole range.

    Now, there is another kind of queries I need to answer. Given some condition on ranges, e.g. "start point < x && end point > y", I would like to generate the "counts array" after removing all ranges that answer the condition. So I load the original sets, remove all ranges that apply, and continue through the pipeline. Obviously, I do not store the filtered sets (the filtering was only done for the purpose of the query).

    One last type of query is a bit different - instead of generating a "counts array", I would like to tell how many ranges answer some condition (e.g. as the e.g. before). This is also why I think I have to keep the original set and not just the inversion list, which I, by the way, do not store but generate on the fly when "counts array" queries are run.

    To conclude, I think I do need to save the original sets on disk, since generating the sets from scratch takes some time and requires reading loading other data as I mentioned in the beginning. I think it makes sense to generate the sets once, then load + filter/answer queries.

    But perhaps I can store the sets more efficiently? Maybe storing them all in one file is better?

    I should note that some sets contain up to 80000 keys (start points) and take 1.5MB when stored (using nstore), so putting together 1000 sets like this might not be such a good idea.

    As always, your ideas are welcomed. I will be more than happy to supply more clarifications.

    Thank you all.

      But perhaps I can store the sets more efficiently? Maybe storing them all in one file is better?

      Given a file 4GB containing 1000 sets of 1 million integers stored in packed binary form, the following Inline::C code accumulated the sums and sums of squares in just under 40 seconds. With some more effort that could be parallelised, but it probably isn't necessary:

      #! perl -slw use 5.010; use strict; use Inline C => Config => BUILD_NOISY => 1; use Inline C => <<'END_C', NAME => '_842899', CLEAN_AFTER_BUILD => 0; void sumEm( SV *acc, SV *SoS, SV *in, int n ) { int *iAcc = (int*)SvPVX( acc ); int *iIn = (int*)SvPVX( in ); int *iSoS = (int*)SvPVX( SoS ); int i; for( i=0; i < n; ++i ) { iAcc[ i ] += iIn[ i ]; iSoS[ i ] += iIn[ i ] * iIn[ i ]; } } END_C use Time::HiRes qw[ time ]; my $start = time; my $acc = chr(0) x 4e6; my $sos = $acc; open I, '<:raw', 'cells' or die $!; while( sysread( I, my $row, 4e6 ) ) { sumEm( $acc, $sos, $row, 1e6 ); } close I; printf "Took: %.6f seconds\n", time() - $start; <STDIN>; my @sums = unpack 'V*', $acc; my @SoSs = unpack 'V*', $sos; print "$sums[ $_ ] : $SoSs[ $_ ]" for 0 .. $#sums; __END__ C:\test> Took: 39.318000 seconds

      So depending how long your current method takes, it might be worth considering.

      For your two "other types of query", on the surface at least, it sounds like they could be quite easily solved using SQL. And calculating means and std deviations is bread & butter SQL.

      Given the more realistic volumes of data you are now describing, an RDBMS, or even SQLite seems like they might be a good fit for your tasks.

      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://843037]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others scrutinizing the Monastery: (1)
As of 2021-10-17 01:15 GMT
Find Nodes?
    Voting Booth?
    My first memorable Perl project was:

    Results (70 votes). Check out past polls.