Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change

Re: Need to Improve Performance

by roboticus (Chancellor)
on Jan 03, 2014 at 12:40 UTC ( #1069135=note: print w/replies, xml ) Need Help??

in reply to Need to Improve Performance


I'd suggest learning SQL a little better so you can make the database do the part it's good at. In particular, start thinking about how to use join and think about data retrieval as a set operation rather than a procedural operation. If you do so, you could execute a single SQL statement and process it something like this (untested):

# Get all combinations of (seg_dtl_id,pan_id) for seg_id=1 $sql = <<EOSQL; select seg.seg_dtl_id, dtl.pan_id from segmentation seg join seg_dtl_panelist dtl on dtl.seg_dtl_id=seg.seg_dtl_id where seg.seg_id='1' EOSQL $rs = $db->query($sql); # Remember previous dtl_id so we can tell to start a new line my $prev_dtl_id; foreach (@$rs) { my ($dtl_id, $pan_id) = @$_; if ($dtl_id ne $prev_dtl_id) { # Only start a new line if we're not on first dtl_id print FILE "\n" if defined $prev_dtl_id; print FILE "$dtl_id: "; $prev_dtl_id = $dtl_id; } print FILE "$pan_id|"; }

You can get fancier and get the SQL server to give you counts, averages, max/min values, subtotals, etc. If you spend a bit of time learning SQL, you'll find a lot of these sorts of operations easier and faster.

Why is your code inefficient? Primarily because you're executing more queries than you need to, leading to:

  • The SQL server needs to parse a query for each data element
  • Perl has to communicate with the SQL server for each query and result set, leading to *multiple* round-trips over the network for each data element.
  • The SQL server which is optimized for the task of data retrieval and joining tables is being ignored, and you're doing the work, yourself, in perl.

Doing it as a single query gives you one round-trip to parse/execute the query, and since the query returns all your results, it can bundle multiple rows per network packet, greatly reducing the amount of network traffic. (That may be a benefit for your other systems, as well!)

Update: Just in case you want to see how much network traffic could potentially be saved, here are some fake numbers, loosely based in reality.

For this calculation, I'm assuming it takes one round-trip for executing the query, and one round-trip for fetching a single result. Since you're fetching back only a single value from the database each time, I'm going to further assume that each result set could hold values at a time.

 # Dtl IDs# PAN IDs/DtlIDYour I/OsMy I/Os
Case 1101020221
Case 1100102,011201
Case 1101002,002201
Case 110010020,0112,001

So with these assumptions, it looks like you'll have 10 times the network traffic with your code. The dominant factor is the number of data values returned per network transmission, which I conservatively put at 10. I expect that it would be greater than that, with a corresponding increase in efficiency between the two methods.

Someone may want to check the calculations and/or method. For the "My I/Os" column, I have 1 round trip for the query, and (# elements per row)*(# dtl IDs)*(# PAN IDs)/(# values per packet). For the "Your I/Os" column, I come up with 2 round trips round trips when # Dtl IDs is 10, and 11 when # Dtl IDs is 100 and I add 2*(# Dtl IDs)*(# PAN IDs) to it (where 2 is 1 round trip for the query, and one for getting the result).


When your only tool is a hammer, all problems look like your thumb.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1069135]
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others scrutinizing the Monastery: (9)
As of 2018-01-18 20:08 GMT
Find Nodes?
    Voting Booth?
    How did you see in the new year?

    Results (214 votes). Check out past polls.