http://www.perlmonks.org?node_id=689265
chexmix has asked for the wisdom of the Perl Monks concerning the following question:

UPDATE FIRST:

I reworked this program and significantly improved performance. There were some mysterious discrepancies in the result set between the old version and the new on one run, but I believe I have those 'figured out.'

Partial profiles of new and old follow. I am cautiously considering this a success:

First version of program:

time elapsed (wall): 1473.9343 time running program: 1473.2193 (99.95%) time profiling (est.): 0.7150 (0.05%) number of calls: 59722 %Time Sec. #calls sec/call F name 92.33 1360.2230 2427 0.560454 DBI::st::execute 3.64 53.5727 2027 0.026430 main::process_x 3.58 52.7029 2007 0.026260 main::process_y 0.15 2.2193 1 2.219282 Term::ReadKey::ReadLine 0.10 1.4189 0 1.418933 * <other> 0.06 0.8885 24294 0.000037 DBI::st::fetchrow_array

Revised program:

time elapsed (wall): 408.6156 time running program: 408.2747 (99.92%) time profiling (est.): 0.3409 (0.08%) number of calls: 32883 %Time Sec. #calls sec/call F name 70.21 286.6553 510 0.562069 DBI::st::execute 24.20 98.7912 4034 0.024490 main::process 4.79 19.5629 1 19.562895 Term::ReadKey::ReadLine 0.27 1.1126 0 1.112580 * <other> 0.16 0.6666 20460 0.000033 DBI::st::fetchrow_array

NOW ON TO THE ORIGINAL POST ...

Good morning Monks -

The poet Charles Olson once wrote, memorably:

I have had to learn the simplest things
last. Which made for difficulties.

This kind of sums up my situation vis-a-vis Perl, I think. I have been flummoxed for the past few days: my lack of substantive CS background has (once again) been chewing a hole in my ... er, back.

This post is in a sense a followup to my earlier post about profiling, and yet isn't about DBI at all, but more about data structures.

I have found that I can essentially grab ALL the data I need to process (for the task outlined in the previous post) with ONE database call per line of input. What comes down from that series of calls looks like this:

21 DET-2 896.657564735788 678.83860967799 21 DET-3 32.0939023018969 621.656550474314 21 DET-3 42.0741462550974 834.842294892622 21 DET-3 218.814294809857 450.606540154849 21 DET-3 228.88830316475 625.939190221948 21 DET-3 630.472705847461 220.839350101088 21 DET-5 152.988115061449 156.31861287082 21 DET-5 730.997702224652 507.421683707195 21 DET-6 506.364456847517 587.275663167673 21 DET-6 573.109998216762 116.126667780714 21 DET-6 885.306844616344 411.352928714465 21 DET-6 959.150025915228 845.316911114704 21 DET-7 62.7170088137102 593.424801945024 21 DET-7 110.245168119381 788.219885220784 21 DET-7 159.254569896235 386.365906980404 21 DET-7 377.53529067825 163.659365696494 21 DET-7 736.734267414092 129.235251032426 21 DET-7 836.081539763363 401.860540038111 21 DET-8 736.566372536132 247.410290038796 47 DET-7 189.488040387042 500.316501378612 47 DET-7 251.972954527148 519.649226713148 71 DET-7 188.133043154801 499.94217650742 71 DET-7 251.06636137579 519.007465693828 88 DET-0 0.70684189743067 391.883292824418 88 DET-0 114.871177986263 212.959076023136 88 DET-0 219.421725079137 710.314439572696 88 DET-0 257.837516726887 594.376577764894 88 DET-1 119.630462310966 260.433234269099 ...

In each line, the first value is an "observation number," the second a "detector number" and the third and fourth values are the x and y coordinates of actual "hits" on the detectors.

I have edited some in this sample of the roughly 19,000 lines but wanted to leave enough to show that:

  1. There are multiple lines where the first item (the "observation number") is the same;
  2. For each observation number, there are multiple lines where the second item (the "detector number") is the same.

So I have been facing the roaring Godzilla that is my lack of experience with data structures, and trying to figure out what might be the best structure I could put this in for processing ...

My first attempt was a hash of arrays, which yielded something like this ...

21 => DET-2, 896.657564735788, 678.83860967799, DET-3, 32.0939023018969, 62 +1.656550474314, DET-3, 42.0741462550974, 834.842294892622, DET-3, 87. +5412177704422, 684.850417188863, DET-3, 92.9823463716063, 216.3390205 +94075, DET-3, 175.151394732114, 525.441189179707, DET-3, 218.81429480 +9857, 450.606540154849, DET-3, 228.88830316475, 625.939190221948, DET +-3, 630.472705847461, 220.839350101088, DET-5, 152.988115061449, 156. +31861287082, DET-5, 730.997702224652, 507.421683707195, DET-6, 784.60 +8063532865, 688.699410601935, DET-6, 885.306844616344, 411.3529287144 +65, DET-6, 959.150025915228, 845.316911114704, DET-7, 62.717008813710 +2, 593.424801945024, 47 => DET-7, 189.488040387042, 500.316501378612, DET-7, 251.972954527148, 5 +19.649226713148, 71 => DET-7, 188.133043154801, 499.94217650742, DET-7, 251.06636137579, 519 +.007465693828,

... note: this data may not quite agree with that above, I am cutting for clarity and this is mostly for illustration purposes.

But it at least looks like this is not processed enough, because of those repeated "DET" values, and that what I really want is to "deepen" the structure one more level, to "pull out" as it were the detector numbers. And it is here that I get stuck, both in terms of "what would be best" and "how do I do that?"

Even perldsc only goes so far in terms of complexity.

At first I thought "it must be a hash of hashes of arrays that I want," and I uncovered this node showing how to create such a thing. BUT to be quite honest, I didn't or couldn't or can't or currently am not able to truly grok the solutions presented at that node. And IS this the best structure for me?

So my questions, I am afraid, are three, which is perhaps a function of the lack of clarity in my thinking:

  1. Given that I want to study the distribution of these x and y points on each detector in each observation, what would be the best structure for this data?
  2. What do I study so I can better see such things (e.g. how do I get there from here)? Are there general books on data structures, or will this knowledge just come with experience?
  3. Is it okay (I know, "according to whom?") to utilize solutions/tools/schemata one does not yet truly understand, and hope for enlightenment to come later?
  4. There is no number 4. Note I am studiedly avoiding asking "so, now do I create the structure suggested by (1)?"

Apologies for the length of this post. I hope there is something of interest in it. I am, once again, feeling stuck and frustrated. I know its no one's responsibility to help me out of my thought ditch, but if anyone has any maps to recommend, I would be grateful.

Regards,

An extremely humble Monk