|Syntactic Confectionery Delight|
Speeding up data lookupsby suaveant (Parson)
|on Sep 19, 2005 at 13:28 UTC||Need Help??|
suaveant has asked for the wisdom of the Perl Monks concerning the following question:
I work for a financial company, and due to some changes in the industry we have to change some processing that used to be run over two hours to run in 10-15 minutes instead, fun! I have been assigned to streamline etc and thought I would bounce some ideas off the monks.
Background: Current system uses what we call shells (just a glorified ordered file, fixed length, with honking amounts of data in it) and holdings files, which keep track of what a user wants us to process for them. The holdings file is iterated through and the shell is searched with a binary search for smaller holdings, and by iteration with larger holdings.
My first thought was maybe to database it, I am pretty good with MySQL but I am having trouble actually getting this to go faster, even using HANDLER calls, so I may abandon this approach.
My next thought is to daemonize the process. Right now each of over 1000 reports is started as its own process and handles all its own reading of the shell... I figure I could daemonize the process so that some caching could be done, and less perl procs would need to be started. I figure there are two possibilities for speeding this up... 1) key caching, 2) read the whole damn shell into memory.
Now, as to reading the whole shell into memory... we have a different shells to work with, the largest being 700M... this stuff is running on big sun boxes with 8 procs and 16GB of memory, both of which can be bumped up some. The only thing that would make this really work is if I am right in remembering that when you fork a process memory from the parent is shared by all the children until that memory is written to, is that right? If so I could read the whole shell, fork off X children and have parallel reading of the info into multiple children without completely blowing out the system memory.
Another thing... if I do read the whole thing into memory, should I still use a binary search. I am thinking that if the identifier list I am working from is also in order, there is a lot of opportunity for speeding up a binary search with some custom code. For instance if I see identifier 8000, I know that no further identifiers will be below 8000, so I can search only from there forward. I could also probably compare the two keys and guess how far forward I should set my midpoint to try to shorten my search.
But is there a better in memory method, or would simple key caching and using a file be better in the long run. Whatever I use to search can't take too too long to preload, since the shell is updated and immediately report processing must begin...
Any thoughts would be appreciated.