$sql = "select seg_dtl_id from segmentation where seg_id = '1'";
$rs = $db->query($sql);
Since this doesn't depend on any data, there's no need to do it all over again in the loop. Do it once outside the loop. Also figure out a way to write a single SQL query that does everything you need (shouldn't be hard with a JOIN), so that you don't have to talk to the database so often.
| [reply] [Watch: Dir/Any] [d/l] |
Naveen_learn:
I'd suggest learning SQL a little better so you can make the database do the part it's good at. In particular, start thinking about how to use join and think about data retrieval as a set operation rather than a procedural operation. If you do so, you could execute a single SQL statement and process it something like this (untested):
# Get all combinations of (seg_dtl_id,pan_id) for seg_id=1
$sql = <<EOSQL;
select seg.seg_dtl_id, dtl.pan_id
from segmentation seg
join seg_dtl_panelist dtl on dtl.seg_dtl_id=seg.seg_dtl_id
where seg.seg_id='1'
EOSQL
$rs = $db->query($sql);
# Remember previous dtl_id so we can tell to start a new line
my $prev_dtl_id;
foreach (@$rs) {
my ($dtl_id, $pan_id) = @$_;
if ($dtl_id ne $prev_dtl_id) {
# Only start a new line if we're not on first dtl_id
print FILE "\n" if defined $prev_dtl_id;
print FILE "$dtl_id: ";
$prev_dtl_id = $dtl_id;
}
print FILE "$pan_id|";
}
You can get fancier and get the SQL server to give you counts, averages, max/min values, subtotals, etc. If you spend a bit of time learning SQL, you'll find a lot of these sorts of operations easier and faster.
Why is your code inefficient? Primarily because you're executing more queries than you need to, leading to:
- The SQL server needs to parse a query for each data element
- Perl has to communicate with the SQL server for each query and result set, leading to *multiple* round-trips over the network for each data element.
- The SQL server which is optimized for the task of data retrieval and joining tables is being ignored, and you're doing the work, yourself, in perl.
Doing it as a single query gives you one round-trip to parse/execute the query, and since the query returns all your results, it can bundle multiple rows per network packet, greatly reducing the amount of network traffic. (That may be a benefit for your other systems, as well!)
Update: Just in case you want to see how much network traffic could potentially be saved, here are some fake numbers, loosely based in reality.
For this calculation, I'm assuming it takes one round-trip for executing the query, and one round-trip for fetching a single result. Since you're fetching back only a single value from the database each time, I'm going to further assume that each result set could hold values at a time.
| # Dtl IDs | # PAN IDs/DtlID | Your I/Os | My I/Os |
Case 1 | 10 | 10 | 202 | 21 |
Case 1 | 100 | 10 | 2,011 | 201 |
Case 1 | 10 | 100 | 2,002 | 201 |
Case 1 | 100 | 100 | 20,011 | 2,001 |
So with these assumptions, it looks like you'll have 10 times the network traffic with your code. The dominant factor is the number of data values returned per network transmission, which I conservatively put at 10. I expect that it would be greater than that, with a corresponding increase in efficiency between the two methods.
Someone may want to check the calculations and/or method. For the "My I/Os" column, I have 1 round trip for the query, and (# elements per row)*(# dtl IDs)*(# PAN IDs)/(# values per packet). For the "Your I/Os" column, I come up with 2 round trips round trips when # Dtl IDs is 10, and 11 when # Dtl IDs is 100 and I add 2*(# Dtl IDs)*(# PAN IDs) to it (where 2 is 1 round trip for the query, and one for getting the result).
...roboticus
When your only tool is a hammer, all problems look like your thumb. | [reply] [Watch: Dir/Any] [d/l] |
I haven't worked with sql for a while but are you sure this is correct?
select max(count(pan_id)) from seg_dtl_panelist where seg_id ='1' group by seg_dtl_id, seg_id
This seens wrong: max(count(pan_id), as count will return 1 value.
| [reply] [Watch: Dir/Any] [d/l] |
| [reply] [Watch: Dir/Any] |