http://www.perlmonks.org?node_id=1011800


in reply to Evolving a faster filter?

As an aside, thousands of objects do not normally come into being instantaneously or en masse; but rather come into being over time.

As your filters are static for any given run of the program, why not pass them through the filters at instantiation time and set a flag? (And perhaps re-run it any time one of the filtered attributes changes.)

Then when you need the filtered subset, you only need a single boolean test for each object.


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

Replies are listed 'Best First'.
Re^2: Evolving a faster filter?
by Ovid (Cardinal) on Jan 06, 2013 at 13:26 UTC

    That sounds like a nice idea, but though the filters are roughly static, their behavior changes quite a bit for the data. For one run, I might want to exclude all objects of property X, with a value of Y below a particular threshold and which wasn't previously selected in the past hour. For the next run, X might be irrelevant (though the filter still gets run because we don't know that until the filter is run), Y might have a different threshold and we might not care about when the object was selected. So multiple filter runs with the same set of objects have, literally, millions of possible different outcomes.

      For one run, I might want to exclude all objects of property X, with a value of Y below a particular threshold and which wasn't previously selected in the past hour. ... So multiple filter runs with the same set of objects have, literally, millions of possible different outcomes.

      Hm. Sounds like one of these 'community driven' "We also recommend..." things that the world+dog have added to their sites recently.

      But still it makes me wonder whether you cannot distribute the load somehow.

      That is, is it really necessary to make the entire decision process actively every time, by running all the filters exactly at the instant of need?

      Or, could you re-run each of the filters (say) once per hour with the then current dataset and only amalgamate the results and make your final selection at that need point.

      You might (for example), run each filter individually and store its result in the form of a bitstring where each position in the bitstring represents a single object in the set. Then, at the time-of-need, you combine (bitwise-AND) the latest individual bitstrings from all the filters to produce the final selection.

      With 100,000 objects, a single filter is represented by a 25k scalar. Times (say) 100 filters and it requires 2.5MB to store the current and ongoing filter set.

      Combining those 100x 100,000 filter sets is very fast:

      use Math::Random::MT qw[ rand ];; $filters[ $_ ] = pack 'Q*', map int( 2**64 * rand() ), 0 .. 1562 for 0 + .. 99;; say time(); $mask = chr(0xff) x 12504; $mask &= $filters[ $_ ] for 0 . +. 99; say time();; 1357485694.21419 1357485694.21907

      Less than 5 milliseconds!

      Assuming your application could live with the filters being run once every time period (say; once per hour or half an hour or whatever works for your workloads), rather than in-full for every time-of-need?

      (NOTE: this is not the method used in my other post which does the full 100,000 objects through 100 filters in 0.76 seconds, but without a feel for how long your current runs take, there is no way to assess how realistic that would be?)


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

      So there's no way to tell how many objects a filter will take out until the filter was run?