Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery

comment on

( #3333=superdoc: print w/replies, xml ) Need Help??

Let me describe the complete scenario and try to make things more clear.

I start by generating 1000 sets of ranges. Generating the sets is done using some other data that needs parsing etc., but it is done only once. From now on, the sets are fixed and I do not change them.

I chose to store each set of ranges as a hash with key= range start point, value = array of lengths of all ranges that start at that start point. For example, the set of ranges ((100,200),(50,70),(50,100)) will be stored as: {100=>[100], 50=>[20,50]}

Each of these sets is stored to disk. I can then generate a "counts array" by retrieving the required set, converting it into an inversion list (as moritz so kindly suggested) and iterating the whole range once. This works great. I can also do the same for each of the 1000 sets and aggregate the results for each position, then derive the mean and standard deviation for each position in the whole range.

Now, there is another kind of queries I need to answer. Given some condition on ranges, e.g. "start point < x && end point > y", I would like to generate the "counts array" after removing all ranges that answer the condition. So I load the original sets, remove all ranges that apply, and continue through the pipeline. Obviously, I do not store the filtered sets (the filtering was only done for the purpose of the query).

One last type of query is a bit different - instead of generating a "counts array", I would like to tell how many ranges answer some condition (e.g. as the e.g. before). This is also why I think I have to keep the original set and not just the inversion list, which I, by the way, do not store but generate on the fly when "counts array" queries are run.

To conclude, I think I do need to save the original sets on disk, since generating the sets from scratch takes some time and requires reading loading other data as I mentioned in the beginning. I think it makes sense to generate the sets once, then load + filter/answer queries.

But perhaps I can store the sets more efficiently? Maybe storing them all in one file is better?

I should note that some sets contain up to 80000 keys (start points) and take 1.5MB when stored (using nstore), so putting together 1000 sets like this might not be such a good idea.

As always, your ideas are welcomed. I will be more than happy to supply more clarifications.

Thank you all.

In reply to Re^3: Optimizing a double loop by roibrodo
in thread Optimizing a double loop by roibrodo

Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":

  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or or How to display code and escape characters are good places to start.
Log In?

What's my password?
Create A New User
Domain Nodelet?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others examining the Monastery: (6)
As of 2021-10-24 19:17 GMT
Find Nodes?
    Voting Booth?
    My first memorable Perl project was:

    Results (89 votes). Check out past polls.