http://www.perlmonks.org?node_id=1069093

Naveen_learn has asked for the wisdom of the Perl Monks concerning the following question:

I have the requirement to print the first record of the $sql_pr based on differnent values of "seg_dtl_id" so that when the "while loop" ends, i have the file with the pan_id belongs to respective "seg_dtl_id" printed in the respective column seperated by "|" like below. 8000|6000|7000 8001|6001|7001 8002|6002|7002 8003|6003|7003 8004|6004|7004 here 8000,8001,8002,8003,8004 are pan_ids belong to a single seg_dtl_id and so and so forth. i can make it with the help of below code but the problem arises when it comes to performance as the code executes the same sql multiple times.Say if number of records are 1 million, the sql has to run 1 million times thereby making performance issues. please note am using basic version of perl (perl5/lib/5.00503) and guide me to reconstruct the code to overcome this problem.

open(file,"> Segmentation.txt"); my $index = 0; my $n = 0; my $sql_ct = "select max(count(pan_id)) from seg_dtl_panelist where s +eg_id ='1' group by seg_dtl_id, seg_id"; my $rs_ct = $db->query($sql_ct); while ($n < $$rs_ct[0][0]) { $sql = "select seg_dtl_id from segmentation where seg_id = '1'"; $rs = $db->query($sql); foreach (@$rs) { my $dtl_id = $$_[0] ; $sql_pr = "select pan_id from seg_dtl_panelist where seg_dtl_id = '$ +dtl_id' and seg_id = '1'"; $rs_pr = $db->query($sql); print file "$$rs_pr[$index][0]|"; } print file "\n"; $index++; $n++; } close (file);

Replies are listed 'Best First'.
Re: Need to Improve Performance
by moritz (Cardinal) on Jan 03, 2014 at 08:10 UTC
    $sql = "select seg_dtl_id from segmentation where seg_id = '1'"; $rs = $db->query($sql);

    Since this doesn't depend on any data, there's no need to do it all over again in the loop. Do it once outside the loop. Also figure out a way to write a single SQL query that does everything you need (shouldn't be hard with a JOIN), so that you don't have to talk to the database so often.

Re: Need to Improve Performance
by roboticus (Chancellor) on Jan 03, 2014 at 12:40 UTC

    Naveen_learn:

    I'd suggest learning SQL a little better so you can make the database do the part it's good at. In particular, start thinking about how to use join and think about data retrieval as a set operation rather than a procedural operation. If you do so, you could execute a single SQL statement and process it something like this (untested):

    # Get all combinations of (seg_dtl_id,pan_id) for seg_id=1 $sql = <<EOSQL; select seg.seg_dtl_id, dtl.pan_id from segmentation seg join seg_dtl_panelist dtl on dtl.seg_dtl_id=seg.seg_dtl_id where seg.seg_id='1' EOSQL $rs = $db->query($sql); # Remember previous dtl_id so we can tell to start a new line my $prev_dtl_id; foreach (@$rs) { my ($dtl_id, $pan_id) = @$_; if ($dtl_id ne $prev_dtl_id) { # Only start a new line if we're not on first dtl_id print FILE "\n" if defined $prev_dtl_id; print FILE "$dtl_id: "; $prev_dtl_id = $dtl_id; } print FILE "$pan_id|"; }

    You can get fancier and get the SQL server to give you counts, averages, max/min values, subtotals, etc. If you spend a bit of time learning SQL, you'll find a lot of these sorts of operations easier and faster.

    Why is your code inefficient? Primarily because you're executing more queries than you need to, leading to:

    • The SQL server needs to parse a query for each data element
    • Perl has to communicate with the SQL server for each query and result set, leading to *multiple* round-trips over the network for each data element.
    • The SQL server which is optimized for the task of data retrieval and joining tables is being ignored, and you're doing the work, yourself, in perl.

    Doing it as a single query gives you one round-trip to parse/execute the query, and since the query returns all your results, it can bundle multiple rows per network packet, greatly reducing the amount of network traffic. (That may be a benefit for your other systems, as well!)

    Update: Just in case you want to see how much network traffic could potentially be saved, here are some fake numbers, loosely based in reality.

    For this calculation, I'm assuming it takes one round-trip for executing the query, and one round-trip for fetching a single result. Since you're fetching back only a single value from the database each time, I'm going to further assume that each result set could hold values at a time.

     # Dtl IDs# PAN IDs/DtlIDYour I/OsMy I/Os
    Case 1101020221
    Case 1100102,011201
    Case 1101002,002201
    Case 110010020,0112,001

    So with these assumptions, it looks like you'll have 10 times the network traffic with your code. The dominant factor is the number of data values returned per network transmission, which I conservatively put at 10. I expect that it would be greater than that, with a corresponding increase in efficiency between the two methods.

    Someone may want to check the calculations and/or method. For the "My I/Os" column, I have 1 round trip for the query, and (# elements per row)*(# dtl IDs)*(# PAN IDs)/(# values per packet). For the "Your I/Os" column, I come up with 2 round trips round trips when # Dtl IDs is 10, and 11 when # Dtl IDs is 100 and I add 2*(# Dtl IDs)*(# PAN IDs) to it (where 2 is 1 round trip for the query, and one for getting the result).

    ...roboticus

    When your only tool is a hammer, all problems look like your thumb.

Re: Need to Improve Performance
by Anonymous Monk on Jan 03, 2014 at 08:39 UTC

    I haven't worked with sql for a while but are you sure this is correct?

    select max(count(pan_id)) from seg_dtl_panelist where seg_id ='1' group by seg_dtl_id, seg_id

    This seens wrong: max(count(pan_id), as count will return 1 value.

Re: Need to Improve Performance
by Anonymous Monk on Jan 03, 2014 at 10:25 UTC

    What modules are you using?