Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine

Speeding up data lookups

by suaveant (Parson)
on Sep 19, 2005 at 17:28 UTC ( #493212=perlquestion: print w/replies, xml ) Need Help??

suaveant has asked for the wisdom of the Perl Monks concerning the following question:

I work for a financial company, and due to some changes in the industry we have to change some processing that used to be run over two hours to run in 10-15 minutes instead, fun! I have been assigned to streamline etc and thought I would bounce some ideas off the monks.

Background: Current system uses what we call shells (just a glorified ordered file, fixed length, with honking amounts of data in it) and holdings files, which keep track of what a user wants us to process for them. The holdings file is iterated through and the shell is searched with a binary search for smaller holdings, and by iteration with larger holdings.

My first thought was maybe to database it, I am pretty good with MySQL but I am having trouble actually getting this to go faster, even using HANDLER calls, so I may abandon this approach.

My next thought is to daemonize the process. Right now each of over 1000 reports is started as its own process and handles all its own reading of the shell... I figure I could daemonize the process so that some caching could be done, and less perl procs would need to be started. I figure there are two possibilities for speeding this up... 1) key caching, 2) read the whole damn shell into memory.

Now, as to reading the whole shell into memory... we have a different shells to work with, the largest being 700M... this stuff is running on big sun boxes with 8 procs and 16GB of memory, both of which can be bumped up some. The only thing that would make this really work is if I am right in remembering that when you fork a process memory from the parent is shared by all the children until that memory is written to, is that right? If so I could read the whole shell, fork off X children and have parallel reading of the info into multiple children without completely blowing out the system memory.

Another thing... if I do read the whole thing into memory, should I still use a binary search. I am thinking that if the identifier list I am working from is also in order, there is a lot of opportunity for speeding up a binary search with some custom code. For instance if I see identifier 8000, I know that no further identifiers will be below 8000, so I can search only from there forward. I could also probably compare the two keys and guess how far forward I should set my midpoint to try to shorten my search.

But is there a better in memory method, or would simple key caching and using a file be better in the long run. Whatever I use to search can't take too too long to preload, since the shell is updated and immediately report processing must begin...

Any thoughts would be appreciated.

                - Ant
                - Some of my best work - (1 2 3)

Replies are listed 'Best First'.
Re: Speeding up data lookups
by dragonchild (Archbishop) on Sep 19, 2005 at 18:02 UTC
    A database is definitely sounding like the appropriate way to go. However, you cannot just shove data into it and expect it to magically go 10x faster. I know you know this, but data stored in a relational form often requires a complete rethinking of the code that uses it. I ran into this recently, so I thought it might be useful for you to be reminded of that. An RDBMS is not a drop-in replacement for anything other than another RDBMS.

    Basically, if you need a 10x speedup, you need to approach the problem from scratch, looking at first principles. It sounds like you need to rethink how you organize your data so that it is more efficient for your normal lookups. This means you need to evaluate what kind of questions you're asking. Often, we think we know the kinds of questions we're asking because it's been working for years, but we're really asking the questions in a slightly different way, which can mean all the difference in the world. This may mean rewriting the code asking the questions, and possibly even how you gather the questions to ask from the user. (Maybe a "holdings file" isn't the appropriate way to handle things ...)

    Basically, the only items you should be keeping when doing this effort are

    1. The raw data
    2. The finished product

    Very often, I find one needs to completely transform the raw data into something else in order to facilitate lookups. Once, I needed to perform three transformations in order to get speed where it needed to be.

    Something else to keep in mind - that 10-15min isn't for one query - it's for N queries. That translates into M queries/second where you're currently at K queries/second. I'd focus on getting K to M, instead of the whole thing at once. Trying to deal with thousands of concurrent items is ... well ... daunting.

    As for MySQL-isms, have you considered turning the query_cache on?

    My criteria for good software:
    1. Does it work?
    2. Can someone else come in, make a change, and be reasonably certain no bugs were introduced?
Re: Speeding up data lookups
by QM (Parson) on Sep 19, 2005 at 18:22 UTC
    Benchmark and Profile your code, or you're tilting at windmills.

    I prefer Devel::SmallProf.

    Quantum Mechanics: The dreams stuff is made of

      Of course! :)

      I have used Devel:DProf before... I will look at SmallProf...

                      - Ant
                      - Some of my best work - (1 2 3)

        I like SmallProf because it does line by line.

        I needed more out of the post-processing script, so I rolled my own. I'm still debating whether to post it or not.

        Update: It's posted here

        Quantum Mechanics: The dreams stuff is made of

      Well, frankly id say that this is the last advice to give. Much better is a period of reflection and analysis, followed by some research. Then you put together a system using sound algorithms. Then finally, when you need to cream some extra speed out of the system, then you do profiling.

      The point is that if you are using Bubble Sort to reorder a million record file in random order you can profile it all you want without ever making it go as fast as a mergesort would have.

      IOW: think first, then profile.


        IOW: think first, then profile.
        Yes, exactly. But that's assuming that we took time in the first place.

        I would go further and suggest a conscious design phase in the beginning. If it's not a one-shot, and speed is important, then start with requirements, design to them, possibly experimenting when choosing between implementations.

        But given an existing system that runs too slowly, profile it now. If it's a bubble sort on a million records, that will be highlighted in million candle spotlights.

        The reason I suggest Devel::SmallProf is because it is more granular than the other oft-mentioned profilers. If the program is one giant subroutine, Devel::DProf will indicate that most of the time is spent in that routine. Devel::SmallProf will indicate which line of code is the worst, the next worst, etc. (for certain values of "line of code").

        So I agree with you, "Think before Do". But sometimes "Do" has already happened, and it's generally more efficient to the let instrumentation tell you where to start looking. It may be wrong, but I'd bet it's right often enough to warrant first action.

        Quantum Mechanics: The dreams stuff is made of

Re: Speeding up data lookups
by CountZero (Bishop) on Sep 19, 2005 at 20:22 UTC
    This really is interesting. What you are trying to do is get a tenfold speed increase by changing the way you handle these data. It is clear that unless your present programs are extremely inefficient, you will not solve this assignment by making some small changes.

    As has already been pointed out before (++ to all my predecessors) a complete rethinking is necessary.

    A databased approach seems promising, but will really only come into its own when you can amortize the cost of setting up the data into the database over many queries, otherwise the loading of the data becomes the bottleneck and you are worse of than before.

    The fact that until now nobody came up with a winning solution is perhaps because we are all groping around in the dark. Other than that you have huge data-files and smaller holdings-files, we know not much of the actual tasks you are being asked ot do. E.g. why are smaller holding files using binary search and larger holding files using an iterative approach? How small is smaller and how large is larger as far as holding files go? What process is making the "shells"? Can this process be changed to load a database or provide a separate index file with the security identifiers directly mapping to the data in the shell? Can the security identifiers be hashed so you have an even faster search mechanism than binary search? B-trees perhaps?, ... A lot of these problems have already been solved by databases of course, but perhaps you do not need the overhead of a real database engine (if you are only searching you do not need the code to update, delete, index, ...) and can extract only the necessary knowledge to do the minimum you need.

    Another question: how much do the shells differ from one another each next run? Id. for the holdings file. Can you somehow only deal with the differences and calculate some delta between the previous value and the new value? (I was inspired by some video codecs that only store the differences between frames in order to save on storage and processor resources). You will have to do some "resyncing" from time to time to see that you are still on track, but perhaps that can be done in a quiet moment when you can spend a few hours to run a full update?

    So many questions, so few answers.

    If you can be a bit more concrete, I'm sure our collective mind will give you some valuable pointers.


    "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law

      Sorry.. again, as I said... I wrote that part poorly... the processing is happening over a 2 hour time, but isn't actually taking two hours... what came in in bits and pieces is now all going to be coming in at once... a conclusive test has not been made yet, but I would guess the speedup needed is more like 2-3 times.

      The binary versus iterative is based on a comparison of the holdings size versus the shell size... it uses a calculation based on log and record numbers to work out the costs of each and compare.

      The shells will not change once the reports start, the reports wait for the data delivery and update, then run. Unfortunately every single report is customized to the user... if they were all the same things would be much, much easier.

                      - Ant
                      - Some of my best work - (1 2 3)

        Are you allowed to show a little glimpse of what a shell and a holding file look like?


        "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law

Re: Speeding up data lookups
by perrin (Chancellor) on Sep 19, 2005 at 19:02 UTC
    Running it in mod_perl or PPerl with no forking should speed things up a lot. Even just running it all from one script would be better than all that forking. Reading all your data into RAM can be very fast, but puts limits on how much you can scale as more data gets added. Putting it into a format like a dbm file where you can efficiently access individual records works well for some things, and doesn't need to read the whole thing into memory. It is somewhat faster than MySQL.

    Your understanding about copy-on-write shared memory is correct, but forking is not always useful. It's most effective in situations where you have a lot of I/O waiting.

      I don't think I was clear enough on that point. The idea of forking was to run maybe as many as 8 children that ran down a list of holdings files processing them, the idea being to take advantage of the many processors of the system... I wasn't planning on making a thousand children or anything like that. In this case, with some data processing happening in perl, this should improve things.

                      - Ant
                      - Some of my best work - (1 2 3)

Re: Speeding up data lookups
by pboin (Deacon) on Sep 19, 2005 at 19:01 UTC

    By monkeying around w/ your own searches, you're probably reinventing some wheels, and doing it rather ineffectively.

    Any high access / high volume data should be indexed as a rule of thumb. Even moving your data into something like SQLite would give you the benefit of indexing.

    If it were me, I'd definitely persue some sort of robust indexing mechanism first. Probably think about threading & smart serialization next.

    Good luck -- that actually sounds like a fun piece of work.

      SQLite is a fantastic piece of work, but I found that once your dataset starts getting large, it becomes quite slow at dumping the data in. Once your sqlite file is over a Gig or so, inserts start slowing to a crawl compared to when the DB is empty (and that is using transactions and dumping in 1000 or 10000 rows at a time). Queries were still surprisingly fast, but inserts really sucked.

      It is of course possible that I was missing something, but I think using DBM files would be faster if you need to regenerate the databases often (ie new data files come in regularly)

      It does sound like a fun project though...

        For my type of work, I load SQLite DBs once, and then do various studies against that data. I load the entire shebang in one transaction, sometimes several million records in one go.

        I've never really studied the difference between the first 100,000 and say the 10th, but that would be quite interesting to see. So, here's a benchmark:

        Inserting 4096 byte strings for 15s per benchmark, ten sets, all in one mongo transaction. Final file size is 2.3G.

        3122.67/s (n=50681) 3166.34/s (n=48825) 3165.59/s (n=50966) 3206.15/s (n=51619) 3157.13/s (n=50230) 3179.75/s (n=50399) 3188.18/s (n=50692) 3211.26/s (n=51316) 3098.32/s (n=49914) 3046.30/s (n=49807)

        I don't really see any slowdown, but maybe I should change the test a bit. At any rate, I'm sticking -- IMNSHO, SQLite is one of the most under-rated pieces of code out there. (When used as indicated.)

Re: Speeding up data lookups
by sgifford (Prior) on Sep 19, 2005 at 19:07 UTC
    Using mmap will give you the same memory sharing effect as reading everything in one process then forking, as well as being easier to manage, somewhat more efficient to read, and making it easier for your OS to manage the memory (data is only paged in as its needed, and the system's buffer cache can just drop pages if it needs the memory, and get it back from the disk later).

    Another possibility is to steal an idea from dbz, which indexes a fixed format text file by using the index values as the keys, and the file offset of the start of the record as values. This is fairly straightforward to do with Berkeley DB. A few well-chosen indexes can make a huge difference, without requiring a full rewrite into a relational database.

    So if you used mmap and a dbz-style index together, you'd get the file offset from the DB, then use substr to inspect the mmap'd data at that location.

      The way the data is currently stored there is only ever one key... the security identifier. mmap would probably make sense, though that is only C, yes? Or is there a way to do it in perl....

      If only C, then I would probably leave it as an option to look at if all Perl fails.

                      - Ant
                      - Some of my best work - (1 2 3)

        As graff says, using mmap from Perl is straightforward using the Mmap module, which will map the file to a Perl string. You do have to be careful how you access the data; some operations will cause Perl to make a copy of the string, which is a bit of a problem with a 1GB string. substr is pretty safe, and probably most other things as long as you're careful not to write anything to the string.

        I wrote up a sample grep implementation using mmap here: Re: anyone have some good mmap examples?


        You'll probably find a few nodes at PM that discuss this module.

Re: Speeding up data lookups
by scmason (Monk) on Sep 19, 2005 at 18:47 UTC
    Have you considered offloading some of the more CPU intensive portions of the application to C? Please don't mod me down for this, but when your program has to a lot in a short amount of time, Perl is not generally the best solution.

    "Never take yourself too seriously, because everyone knows that fat birds dont fly" -FLC
      Yes, and if we must, we must... the programmers who maintain this are not overly proficient in C, I do not believe... hell, I'm no C wizard, but I can get by.

      For maintainability and consitency we are first approaching the Perl side.

                      - Ant
                      - Some of my best work - (1 2 3)

        Inline::C is a great way to throw in just a little bit of C code. If DProf finds that there are a few routines using up most of your time, this makes it easy to rewrite just those routines in C while keeping the rest in Perl. But to get the tenfold speed you're looking for, you'll almost certainly need algorithm changes (such as using indexes), not just simple optimizations like these.
Re: Speeding up data lookups
by sauoq (Abbot) on Sep 20, 2005 at 01:42 UTC

    I'm going to go against the grain a bit here... I don't believe an RDBMS would be right for you. You don't seem to need flexibility. You just need speed. Moving to an RDBMS now is also likely to be costly in terms of code changes.

    I do think that you should be indexing your "shell" data though. I did something similar years ago by using an in-memory BerkeleyDB (DB_HASH). (It just so happens that I was working for a financial management company too.) Although what I did was all in C, I think you could actually do this pretty easily using BerkeleyDB from Perl and get all the real benefits. Keep in mind that, even though a 700MB file really isn't a big deal when you've got 16GB of RAM, Perl's representation of that data is going to include a lot of overhead, so it's best if you can skip it.

    Doing it that way should have a pretty minimal affect on your code, all things considered. In fact, you might find it simplifies things. You won't need any heuristics to determine how to search your "shell"... you'll just iterate through your holdings doing lookups. I think that boils down to O(N) vs. O(N*M) in your current situation.

    That's the direction I'd be looking anyway.

    "My two cents aren't worth a dime.";
      This was kind of my thought... the system has been designed so heavily toward this paradigm it would require a major rework of not only this process but many other processes in the system, and its not my system. I'm working on a deadline and the people I am passing it off to need to be able to maintain it.

      I would love to shove this off into a database, cuz I love databases, but unfortunately I don't see this as being the way to go.

      The thing I am playing with right now is preprocessing all the holdings files ahead of time and then traversing the shell once, populating all the reports in one big hit... I think it will probably yield the most favorable results without completely disrupting their entire system.

                      - Ant
                      - Some of my best work - (1 2 3)

Re: Speeding up data lookups
by elwarren (Curate) on Sep 19, 2005 at 20:00 UTC
    Using a database (properly) will no doubt speed up your searches. The database is there to sort, index, and cache your information. Everyone has pretty much covered this already. The only performance penalty I see will depend on how often you need to load new shell files into the database and how long it takes to index them. It might not be the fastest solution if they are completely replaced by new files a few times a day. On the other hand, if you have time to load and index that 700mb file once, you'll see a big improvement in running reports against it...

Re: Speeding up data lookups
by BerntB (Deacon) on Sep 19, 2005 at 23:23 UTC
    I am thinking that if the identifier list I am working from is also in order, there is a lot of opportunity for speeding up a binary search with some custom code. For instance if I see identifier 8000, I know that no further identifiers will be below 8000, so I can search only from there forward. I could also probably compare the two keys and guess how far forward I should set my midpoint to try to shorten my search.
    Comparing ID strings or numbers is fast. If you do ten compares with binary search, you have found the 1/1000th part of your data which contain what you look for.

    Can you really guess so good that you get to 0.1 % of where you wanted -- in code that is faster than ten compares? Maybe. The people recommending profiling was certainly right. Speed is often not intuitive. (Their advice to find the slow part first and optimize that is certainly good.)

    Otherwise, you know the quote?

    "Almost all programming can be viewed as an exercise in caching"

    Your idea to just pull it into memory (if it fits) sounded good. But it is hard to give specific advice without more specific info.

    Any solution to your problem will probably be based on doing some early handling of your data (or something else) that depends upon some details. You need to give more info, I believe.

Re: Speeding up data lookups
by sgifford (Prior) on Sep 20, 2005 at 02:37 UTC
    Another thought: if you have a large number of queries working against the exact same data, you could consider indexing the queries instead of the data, then make a single pass over the data to see which records match which queries, and send those records to the appropriate place. This is an idea from streaming databases, but it might work for you.

    For example, if your queries are all for ranges of records in your file, you could index which queries start and stop at which records, then inspect each record, see if any queries should start or stop on that record, then send it to all active queries. I'm sure real life won't be so simple, but you get the idea.

What do the data look like?
by pdcawley (Hermit) on Sep 20, 2005 at 08:36 UTC
    Until we know what your data look like, we're whistling in the dark.
Re: Speeding up data lookups
by Anonymous Monk on Sep 20, 2005 at 12:46 UTC
    Would Memoize help?
Re: Speeding up data lookups
by raafschild (Novice) on Sep 21, 2005 at 16:44 UTC
    Some thoughts:

    Are the processes CPU-bound or IO-bound? If IO-bound you might consider making the file system cache bigger (as long as you're not paging). If you also can arrange that all reports that use the same shell run after each other ( or 8 at the same time to use all procs) you can be pretty sure that all reports will find all the data cached in memory. You will still have the overhead of the OS.

    Is it true that a holdings file defines a subset of the shell file? If so, is it possible to split up the shell file in several subset files, and then run all the reports on the appropiate subset files. The report processes then don't have to worry about iterating through the data or using a binary search, they need it all.
    This is most efficient if several reports need the same subset.
    You will need at least twice the disk space, depending on how much overlap there is between the holdings files.
    It should be possible to create all the subset files with a single pass through the shell file, as long as you don't run into the maximum number of open files per process. That depends on the number of subset files you need to create.


Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://493212]
Approved by kirbyk
Front-paged by friedo
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others lurking in the Monastery: (4)
As of 2020-01-19 18:51 GMT
Find Nodes?
    Voting Booth?