comment on

This really is interesting. What you are trying to do is get a tenfold speed increase by changing the way you handle these data. It is clear that unless your present programs are extremely inefficient, you will not solve this assignment by making some small changes.

As has already been pointed out before (++ to all my predecessors) a complete rethinking is necessary.

A databased approach seems promising, but will really only come into its own when you can amortize the cost of setting up the data into the database over many queries, otherwise the loading of the data becomes the bottleneck and you are worse of than before.

The fact that until now nobody came up with a winning solution is perhaps because we are all groping around in the dark. Other than that you have huge data-files and smaller holdings-files, we know not much of the actual tasks you are being asked ot do. E.g. why are smaller holding files using binary search and larger holding files using an iterative approach? How small is smaller and how large is larger as far as holding files go? What process is making the "shells"? Can this process be changed to load a database or provide a separate index file with the security identifiers directly mapping to the data in the shell? Can the security identifiers be hashed so you have an even faster search mechanism than binary search? B-trees perhaps?, ... A lot of these problems have already been solved by databases of course, but perhaps you do not need the overhead of a real database engine (if you are only searching you do not need the code to update, delete, index, ...) and can extract only the necessary knowledge to do the minimum you need.

Another question: how much do the shells differ from one another each next run? Id. for the holdings file. Can you somehow only deal with the differences and calculate some delta between the previous value and the new value? (I was inspired by some video codecs that only store the differences between frames in order to save on storage and processor resources). You will have to do some "resyncing" from time to time to see that you are still on track, but perhaps that can be done in a quiet moment when you can spend a few hours to run a full update?

So many questions, so few answers.

If you can be a bit more concrete, I'm sure our collective mind will give you some valuable pointers.

CountZero

"If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law

In reply to Re: Speeding up data lookups by CountZero
in thread Speeding up data lookups by suaveant

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


more useful options
	PerlMonks