Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

A profiling surprise ...

by chexmix (Hermit)
on May 23, 2008 at 18:13 UTC ( #688185=perlquestion: print w/replies, xml ) Need Help??

chexmix has asked for the wisdom of the Perl Monks concerning the following question:

Good pre-long-weekend, Monks -

I recently reached a point with my 2nd big Perl project where it seemed to work okay, so I ran it on the whole dataset I needed to chew through.

It took 19 minutes to complete.

Now, since the scientist I'm working with at one point said "It can run all night if it needs to" I figure 19 minutes may be acceptable - BUT: 19 minutes is pretty much forever anyway, so I decided to look into profiling and optimizing it, since I didn't know much about either.

The code for this piece is a few hundred lines long so I don't want to post the whole thing - I hope it will be sufficient for purposes here to DESCRIBE it, and give a sample of some of the profiling output I'm seeing.

I am interacting with a huge database of astronomical observation info. My program takes as input a list of 'observation numbers.'

For each distinct observation number, there many be 1-7 individual 1024x1024 px CCD detectors involved in the observation. So the first thing the program does is query the database to see which detectors are involved.

Then, for each detector that IS involved, it creates a 'virtual CCD' or histogram of the detector in memory by once again querying the database and getting information about each 'hit' on the detector for that observation number.

Whew ... THEN the fun part begins (or so I thought). The program then crunches over the 'virtual CCDs' looking for anomalous 'hit pileups,' #s of 'hits' that go over a certain predefined threshold and _may_ indicate some sort of technical problem with that detector (but which at least deserve closer scrutiny). It does this by passing a sliding window a certain # of px wide over the virtual CCD, in both the x and y axes, and reporting back the # of hits within the window, comparing the count, then, to the established threshold.

Now for the (to me anyway) mystery. When I initially took this on it seemed that what was wanted was something that would hit the db again and again and again with a sliding window to build up the histogram/virtual CCD ... when it was demonstrated to me that this would put an enormous strain on the database, I decided to build the picture of 'hits' in memory first and then iterate over it (e.g. putting the burden on Perl as opposed to the database). I was still hitting the database multiple times for each observation number, but nowhere near as often as I would have done.

Cough. Okay. So I ran the program on a small set (13) of observation numbers, with Devel::Profile and took a look at the results. I was surprised. Here are the first few lines of prof.out:

time elapsed (wall): 23.7359 time running program: 23.7131 (99.90%) time profiling (est.): 0.0227 (0.10%) number of calls: 1572 %Time Sec. #calls sec/call F name 81.65 19.3624 51 0.379655 DBI::st::execute 6.88 1.6309 1 1.630945 Term::ReadKey::ReadLine 5.15 1.2221 63 0.019399 main::process_x 4.79 1.1370 60 0.018949 main::process_y 0.27 0.0629 0 0.062879 * <other> 0.19 0.0446 1 0.044613 <anon>:sliding_bin:23

... so it is STILL the db work that is making the program slow. Is this just in the nature of the beast, or do these prof.out lines indicate I have dones something horribly wrong (or are these top lines not suff to show that)?

Note, the entire dataset is over 500 observation numbers long. So that is a lot of db calls, and a lot of processing of the results of each call.

Replies are listed 'Best First'.
Re: A profiling surprise ...
by Limbic~Region (Chancellor) on May 23, 2008 at 19:17 UTC
    chexmix,
    IO tends to be an overshadowing bottle neck. This is the nature of the beast. You may want to read:

    I disagree with the other monk in the thread that said to move as much as possible away from the DB. In general, you want to reduce the number of times that you ask the database to do work for you. The work that you do leave for the DB should be work that the DB is good at. For instance, don't sort your records in perl land if it makes sense to use an "order by" clause.

    You should also recognize that DBI is a pretty heavy API even though great work continue to goes into optimizations. The abstraction layer between you and the DB which makes you able to write programs faster also means the program runs slower.

    Without seeing the code nor data, it is hard to provide more than general advice. The code you indicated sounds like it would easily fit into <readmore> tags. If you want specific help - share the code.

    Cheers - L~R

      Limbic~Region,

      Thanks. I will read those linked documents and see how they simmer.

      Silly question regarding the posting of code ... I suppose I need to ask internally where I work, but has there ever been a case of someone posting code here and getting in hot water at work for doing so? I'm probably just being paranoid, but visions of sugar-plum non-disclosure statements are dancing in my head.

Re: A profiling surprise ...
by Narveson (Chaplain) on May 23, 2008 at 18:28 UTC

    Yes, the bottleneck is database interaction.

    As you continue to write Perl programs that interact with a database, this will cease to surprise you.

    You're on the right track: keep thinking of ways to shift the burden away from the database.

      More specifically, shift the IO burden away from the connection between your program and the database.

      • Never ask for more data from the database than your code will actually use. Let the DB winnow it down and return just what is needed. That means using proper queries with WHERE clauses and, if applicable, LIMIT clauses.
      • Don't return unsorted information then sort it, as databases are good at sorting quickly.
      • Use placeholders instead of building new queries from scratch by concatenation, as that helps the DB's execution engine minimize its work (more importantly, using placeholders can make SQL injection attacks much less likely anyway).
      • If you're going to search on a particular column a lot, index it. If you're not searching on it much but you're going to sort on it fairly often, index it anyway.
      • Select a row by primary key if you know it and want just that row.

      I'm sure there are other monks with even better advice, but these will help quite a bit.

      UPDATE: fixed thinko s/WHEN/WHERE/ (thanks, kyle, for pointing that out in a msg).

        For the last I'd rather say, select a row by the clustered index if you know it, it's unique and you want just that row. Not sure other databases use the same terminology as MSSQL, but in MSSQL the clustered index is the index that controls the order in which the data are actually stored so it's the fastest one to use. And it doesn't have to be the primary key.

        I agree completely with the rest!

Re: A profiling surprise ...
by graff (Chancellor) on May 23, 2008 at 21:59 UTC
    For each distinct observation number, there many be 1-7 individual 1024x1024 px CCD detectors involved in the observation ... the entire dataset is over 500 observation numbers long...

    If I understand you correctly, you have a list of 500 data sets, where each set comprises 1 to 7 matrices of 1K x 1K values (bytes? ints?); your test run of 13 "observation numbers" yielded 51 query executions, so maybe there was an average of about 3 matrices per observation? (13 queries to get the lists of detectors per observation, plus 13 * 3 queries to get the matrices for three detectors, would come to 52 queries -- am I on the right track here?)

    You might want to consider separating the data fetching from the actual computation, especially if you plan to be chewing over this particular set of 500 observations for a while (e.g. trying different statistical summaries, grouping or sorting things in different ways, etc).

    Local disk i/o on whatever machine is running the perl script will be a lot faster, and a lot less wear-and-tear on the DB server, than (re)fetching ... what is it? 500 * (3 or 4) * 1K * 1K * (whatever byte count per CCD pixel), so somewhere between 1.5 GB and 4 GB, or something like that?

    Try just storing all those matrices as local data files, and just do that once. Then write a perl script to do stuff with the local data files. You'll have a lot more flexibility in terms of playing with different strategies for computation and organization that way, and completion time for any given approach will be faster.

      "If I understand you correctly, you have a list of 500 data sets, where each set comprises 1 to 7 matrices of 1K x 1K values (bytes? ints?); your test run of 13 "observation numbers" yielded 51 query executions, so maybe there was an average of about 3 matrices per observation? (13 queries to get the lists of detectors per observation, plus 13 * 3 queries to get the matrices for three detectors, would come to 52 queries -- am I on the right track here?)"

      Sortakinda. The main diff is that each chip/matrix/CCD might only have a few 'hits' on it. So what my two DB queries do is:

      1. for each observation number in the list, return the chips/CCDs that actually have hits;
      2. for each chip/CCD involved in that observation number, return the x and y value of each hit on the chip/CCD.

      Hope that makes more sense. The crunching that follows iterates over the result sets ... but is decoupled from the DB calls.

Re: A profiling surprise ...
by runrig (Abbot) on May 23, 2008 at 19:46 UTC
    51 calls to execute() takes 19 seconds? (Without much data, I'm making wild conjectures ...) that seems long, but it sort of makes sense if the database returns all of the results to the client upon execute, and there are a lot of results, and fetch() just iterates through those results. And maybe you need to look at the queries and the indexes on the tables.
Re: A profiling surprise ...
by dragonchild (Archbishop) on May 23, 2008 at 20:35 UTC
    There is an art to making DB-bound applications run efficiently. The short of it is "Index, index, index".

    Now, another solution would be to construct your data such that the hit pileups can be detected as part of a SQL query. This isn't as hard as it sounds. Take a look at the MySQL function COALESCE().


    My criteria for good software:
    1. Does it work?
    2. Can someone else come in, make a change, and be reasonably certain no bugs were introduced?
      Hmm ... walking to and from the barbershop about an hour ago, I was indeed thinking that "perhaps something could be done data-wise to minimize the number of database calls."

      I'll take a look at that MySQL feature ... am stuck with Sybase here, I'm afraid.

      At least I got a haircut. :D

        The key isn't to reduce the number of database calls. It's to reduce the overall impact of those calls. Let's say that it takes 5 seconds to calculate one of these things, of which 4 seconds is finding and retrieving data. Now, let's say you could rewrite how you retrieve the data to take 3 seconds to retrieve and only .5 seconds to work with. So, 5 seconds drops to 3.5 seconds. Now, let's say that you could do 10 of these in the same 3.5 seconds.

        This isn't just idle speculation. Over the years, I've done 99% and even 99.99% improvements in DB-bound programs. A lot of it has to do with how you store the data. Just because you collect it in one form doesn't mean you search it in that form.


        My criteria for good software:
        1. Does it work?
        2. Can someone else come in, make a change, and be reasonably certain no bugs were introduced?
Re: A profiling surprise ...
by brian_d_foy (Abbot) on May 24, 2008 at 10:15 UTC

    Now that you know that the database is sucking up a lot of time, profile DBI. See the docs for DBI::Profile or the chapter of profiling in Mastering Perl which also talks about DBI profiling.

    Good luck! :)

    --
    brian d foy <brian@stonehenge.com>
    Subscribe to The Perl Review
Re: A profiling surprise ...
by mpeppler (Vicar) on May 27, 2008 at 10:07 UTC
    The first thing you should do is to check that the DB queries you issue are efficient. With Sybase this means setting the SHOWPLAN and STATISTICS IO options, and checking to make sure that all accesses are done via indexes (no table scans for actual physical tables), and that the query plans make sense.

    The second thing is to see if the data fetch algorithm can be improved to reduce the amount of work needed to find the appropriate data. DB engines are optimized (and incredibly efficient) at finding data given a set of criteria and proper indexes/joins between tables - making use of that power can improve overall speed by several orders of magnitude (just as dragonchild mentioned above)

    Michael

      Mein Gott, I have so much to learn! I waver between being excited about that and being overwhelmed by that ... which is normal, I suppose.

      Thanks!

Re: A profiling surprise ...
by chrism01 (Friar) on May 26, 2008 at 01:21 UTC
    Just a general comment re my experiences with similar issues.
    I discovered that just because a DB has built-in fns to do handy stuff like eg sum(), group by for example, it doesn't necessarily mean they are efficiently coded.
    I actually improved a prog by order of magnitude iirc, by in fact doing relatively simple SELECTS instead, and doing those fns in Perl.
    Its worth testing stuff like that instead of assuming.
    I'd also go with checking/setting indexes as advised above.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://688185]
Approved by superfrink
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others lurking in the Monastery: (4)
As of 2020-11-28 11:05 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?