http://www.perlmonks.org?node_id=689341


in reply to structuring data: aka walk first, grok later

I've read this post through five times (as I did the previous one) and I get the same feeling each time. You're asking "What's the best data structure", but until we know how you are going to process the data, that's an impossible question to answer. The right data structure for a given task, always depends upon the details of the task.

In A profiling surprise ... you said:

The program then crunches over the 'virtual CCDs' looking for anomalous 'hit pileups,' #s of 'hits' that go over a certain predefined threshold and _may_ indicate some sort of technical problem with that detector (but which at least deserve closer scrutiny). It does this by passing a sliding window a certain # of px wide over the virtual CCD, in both the x and y axes, and reporting back the # of hits within the window, comparing the count, then, to the established threshold.

But that, especially when combined with the sample data above, leaves so many unanswered (and as yet, unasked) questioned about the nature of this data, and the processing you need to perform on it, that it make me think that the replies so far are entirely premature.

Some questions that I think need answering before any good reply can be made.

  1. What are the datapoints in the sample data? What do they represent?

    For example: As far as I know, CCD means 'charge-coupled device'. These are (basically, very simplified for discussion purposes only), XxY grids of cells that accumulate a charge proportional to the number of photons that hit them during exposure. These (analogue) charges can then be 'read' from the grid, cell by cell, via an n-bit d/a converter to produce an XxY array of values.

    You mention that XxY is 1024x1024 pixels. On a digital device you cannot have partial pixels. So how come your dataset has X:Y pairs like: 896.657564735788 678.83860967799?

    And where are the charge values?

  2. Why are you building a data structure in the first place?

    Might sound like a dumb question, but from your description (and ignoring the decimal pixels anomoly for now), it sounds like you ought to be able to process the data line-by-line to produce the output you need. And thereby avoid having to "build a data structure from the input data" at all.

    If that is the case (and we'd need a far better description of the actual processing required to say for sure), then the more important problem to solve is: what data structure is required to accumulate and present the results of the processing.

    For example. (again, ignoring the decimal pixels anomoly), let's assume that each of your data points represents a 'hit' on a particular (integer) pixel of a given detector during a given observation. And the purpose of your task is to count the hits, per pixel, per detector, over the accumulated observations.

    In this case, the data structure needed to accumulate that information is fairly obvious. You need an XxY (1024x1024) array of counts, per device (you say 7 detectors in the earlier post, but show 9 (DET-0 thru DET-8) in the sample data.

    A first pass would suggest an AoAoA (7(or 9) x 1024 x 1024 ), but as the pixel data seems to be quite sparse, you'd probably save space by using hashes instead of arrays for the X & Y components. And as detector IDs are textual, we can save a bit of effort by using a hash at the base level also.

    So that gives us a HoHoH. This structure makes it easy to accumulate the counts:

    my @dets; while( my( $obs, $det, $x, $y ) = getNextRow() ){ $dets{ $det }{ int $x }{ int $y }++; }

    And that would be pretty much it. You could now perform your 2-dimensional sliding window processing using a few nested loops. But, any example would be pointless, as there is too much speculation in the above as to what you are actually trying to achieve.

Conclusion: I think you are asking the wrong questions and not supplying enough information for us to guess what answers will really help you. I think you are worrying about how to store the data read in, when you may not need to store the data at all. You probably should be more concerned with how to accumulate and store your results, but it's impossible to make good suggestions on the basis of the limited information about the processing you are trying to do, and the nature of that raw data you are starting with.


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

Replies are listed 'Best First'.
Re^2: structuring data: aka walk first, grok later
by chexmix (Hermit) on May 30, 2008 at 21:40 UTC
    "Might sound like a dumb question, but from your description (and ignoring the decimal pixels anomoly for now), it sounds like you ought to be able to process the data line-by-line to produce the output you need. And thereby avoid having to "build a data structure from the input data" at all."

    And ... yeah. I started to think this today, too. :^| It all could be a case of needless 'complexifying' on my part.

Re^2: structuring data: aka walk first, grok later
by chexmix (Hermit) on May 30, 2008 at 21:37 UTC
    Understood, and thanks. I think I get frozen in the headlights, get all bunched-up mentally, and then post ... well, semi-incoherently. Then thanks to some Monk-ly patience I will get some advice, walk away, keep hammering, and eventually get there ... and often regret the semi-incoherent posts later.

    As is the case here.

    There are other factors (particulars re: my job), but I'll leave it there for now, and ponder how to be more specific and focused with posts in future. :)

Re^2: structuring data: aka walk first, grok later
by chexmix (Hermit) on Jun 05, 2008 at 19:19 UTC
    I'm gonna try to keep this short b/c it seems I get into trouble when I "yammer on".

    In re: the non-integer values for x and y. The full explanation is mathematically forbidding but has to do with the nature of the detecting instrument: coordinate systems are actually transformed somewhat during data processing.

    But to cut it short: in my first version of the program I converted these values to integers anyway (a sanctioned move - I didn't just decide to do that on my own). I muddied the issue here by posting the full precision values in this post.

    In re: what I am trying to do. Essentially I am trying to find places on the detectors where 'hits' represented by the pixel values seem to "bunch up." These 'hits' represent places on the detectors where photons have struck. A number of "hits" in x or y that goes over a predetermined value _may_ indicate something that needs to be looked at more closely (e.g. by human eyes).

    The first version of the program took a list of observation sessions, represented by numbers, as input. For each observation session, it did a database call to find out which of the seven detectors/CCDs were involved.

    THEN, for each detector in that observation, it did a database call to pull in the data for the "hits", populating an array for the x axis and one for the y axis of the detector.

    Then it iterated over those built-up arrays for x and y, kind of doing a histogram in memory (repeat for each detector, then move on to the next observation) ...

    I must emphasize: this approach worked. But it's apparently inefficient, especially in terms of time (total run time: 19 minutes) spent doing db calls. So I figured out how to pull all the data in first. This takes only 2 minutes.

    All the lines of the lump are like this:

    $observation, $detector, $x_coord, $y_coord

    Now I keep getting stuck trying to get the big lump to do what I want:

    ... to give me an array of the x values and an array of the y values for a SINGLE detector in a SINGLE observation. And so on, through the lump, until I am done. I need to examine the DISTRIBUTION of values in x and y axes of each detector, in each observation, individually.

    Maybe I should be satisfied with my 19 minute runtime, and leave the data munging / structures alone until I am more experienced ... ? I don't know.

    Do I need a data structure? I don't know that either. It feels like I do, because without one I don't know how to "address" subsets of the lump of data.

    I hope that's clearer, anyway. I don't know why I am so stuck, and I am sorry I am.

      If I understand you correctly, then each datapoint (O,D,X,Y) represents one photon hitting a pixel(XY) of a detector (D) during an observation (O). And that pixel may be struck zero, one or many times during a given observation. If it is hit more than once, then there will be multiple, identical, (O.D.X.Y) datapoints in the dataset for that detector/observation pairing?

      Is that correct?


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        No. Sorry again for my lack of clarity. I was told that each (x, y) pair is s/t that the system has recorded as a positive id, e.g. it represents a 'thing' that has been recorded as having been observed.

        But due to the nature of the instrument, some or many of these may be spurious: "streaks" on the detector, for example. Such things will show up as a "pileup" of points within a given window (so many pixels wide) in x or y.