|laziness, impatience, and hubris|
Bucketing,Slicing and Reporting data across multiple dimensionsby Voronich (Hermit)
|on Aug 17, 2011 at 15:20 UTC||Need Help??|
Voronich has asked for the
wisdom of the Perl Monks concerning the following question:
tye: this isn't usenet. It doesn't have to be "perl specific".(Careful what you ask for ;))
I've got some data, and I'm having some trouble figuring out how to effectively report on it.
The source data itself:
I have a dataset (let's call the file "dataset.dat".) All numbers and data have been falsified. Here's a slice:
What this represents is two giant test analytics runs. The first and second columns represent the test inputs. The third and fourth are the outputs of the baseline run and the "test" run. The fifth is the simple difference between them.
In the real data there are 15,000 foos permuted into 550 IDs. For our purposes the list of Foos is precisely the same between runs (i.e. differences have been slurped out of the file.)
The problem I'm trying to solve:
The first quetsion was: Which Foos show impact between the two runs?
That's all well and good.
What is misleading about the result of this search is that there's no way at this level to distinguish between the "foo3" with a one-shot impact in "IDd" and "foo2" which has impact everwhere it appears.
So there are two additional dimensions of analysis which are important.
I can see a grid with buckets of percent impact across (say... 20 columns of 5% slices) then percent buckets of "percentages of IDs thusly impacted." But my concerned is that it then becomes too abstract to be useful.
I'm just lost in the mire of this stuff.
EPILOGUE: I did end up going with a heavily permuted version of blue_cowdog's solution (thanks again o/ ) since the data itself isn't really continuous enough for 'clustering' that would be revealed by a graphic solution to make much sense. (Though I'm morally obligated as a nerd to noodle around with roboticus and pvaldes' ideas. Thanks for those too o/.)
What ended up happening is this: Friday night at 5:30 I was working from home, running one more cross-section required for audit verification of the release that was already under way, when I suddenly couldn't find the data. (I had been working in a local workspace and went back to the server for a couple more gigs of data to sift through.)
Turns out, a n00b in Houston decided that the error he was getting running his reports were due to disc space. So he, without so much as a peep, deleted everything... just nuked the whole tree.
The upshot of this is I actually have to start from scratch.