Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris

Bucketing,Slicing and Reporting data across multiple dimensions

by Voronich (Hermit)
on Aug 17, 2011 at 15:20 UTC ( #920714=perlquestion: print w/replies, xml ) Need Help??
Voronich has asked for the wisdom of the Perl Monks concerning the following question:

tye: this isn't usenet. It doesn't have to be "perl specific".
(Careful what you ask for ;))

I've got some data, and I'm having some trouble figuring out how to effectively report on it.

The source data itself:

I have a dataset (let's call the file "dataset.dat".) All numbers and data have been falsified. Here's a slice:

IDa,foo1,100,100,0 IDa,foo2,200,301,101 IDa,foo3,300,300,0 IDa,foo4,400,501,101 IDb,foo1,100,100,0 IDb,foo2,200,301,101 IDb,foo3,300,300,0 IDb,foo4,400,500,100 IDc,foo1,200,200,0 IDc,foo2,200,301,101 IDc,foo3,300,300,0 IDc,foo4,400,500,100 IDd,foo1,900,900,0 IDd,foo2,200,301,101 IDd,foo3,300,301,1 IDd,foo4,400,400,0

What this represents is two giant test analytics runs. The first and second columns represent the test inputs. The third and fourth are the outputs of the baseline run and the "test" run. The fifth is the simple difference between them.

In the real data there are 15,000 foos permuted into 550 IDs. For our purposes the list of Foos is precisely the same between runs (i.e. differences have been slurped out of the file.)

The problem I'm trying to solve:

The first quetsion was: Which Foos show impact between the two runs?

That's trivial:

grep -v ",0$" dataset.dat | awk -F',' '{print $2}' | sort | uniq foo2 foo3 foo4

That's all well and good.

What is misleading about the result of this search is that there's no way at this level to distinguish between the "foo3" with a one-shot impact in "IDd" and "foo2" which has impact everwhere it appears.

So there are two additional dimensions of analysis which are important.

  1. What is the distribution of the number of impacts of Foos across IDs.
  2. What is the distribution of the impacts themselves (by percentage buckets) Across Foos.

I can see a grid with buckets of percent impact across (say... 20 columns of 5% slices) then percent buckets of "percentages of IDs thusly impacted." But my concerned is that it then becomes too abstract to be useful.

I'm just lost in the mire of this stuff.

Any ideas?

EPILOGUE: I did end up going with a heavily permuted version of blue_cowdog's solution (thanks again o/ ) since the data itself isn't really continuous enough for 'clustering' that would be revealed by a graphic solution to make much sense. (Though I'm morally obligated as a nerd to noodle around with roboticus and pvaldes' ideas. Thanks for those too o/.)

What ended up happening is this: Friday night at 5:30 I was working from home, running one more cross-section required for audit verification of the release that was already under way, when I suddenly couldn't find the data. (I had been working in a local workspace and went back to the server for a couple more gigs of data to sift through.)

Turns out, a n00b in Houston decided that the error he was getting running his reports were due to disc space. So he, without so much as a peep, deleted everything... just nuked the whole tree.

The upshot of this is I actually have to start from scratch.



Replies are listed 'Best First'.
Re: Bucketing,Slicing and Reporting data across multiple dimensions
by blue_cowdawg (Monsignor) on Aug 17, 2011 at 15:43 UTC

    This may be an oversimplification, but here is one idea:

    #!/usr/bin/perl -w use strict; open FIN,"<dataset.dat" or die "dataset.dat:$!"; my $buckets={}; while(my $line=<FIN>){ chomp $line; my @f=split(",",$line); unless (exists($buckets->{$f[1]}){ $buckets->{$f[1]} = { count => 0, where =>[ ] }; } if ( $f[4] > 0 ) { $buckets->{$f[1]}->{count}++; push @{$buckets->{$f[1]}},$f[0]; } } close FIN; foreach my $key (sort keys %$buckets){ next unless $buckets->{$key}->{count} > 0 ; printf "%s\t%d\t%s\n", $key,$buckets->{$key}->{count}, join(",",sort @{$buckets->{$key}->{where}}); } # ####################################################### done

    Caveat: not tested.. do not use for the control of a nuclear missile launch system.

    Peter L. Berghold -- Unix Professional
    Peter -at- Berghold -dot- Net; AOL IM redcowdawg Yahoo IM: blue_cowdawg
      Thanks very much! This looks remarkably close to what I'm looking for. Debugging now. (It's complaining about that "push" not referencing an array ref.)

        Doh!!! It should read:

        push @{$buckets->{$f[1]}->{where}},$f[0];

        Peter L. Berghold -- Unix Professional
        Peter -at- Berghold -dot- Net; AOL IM redcowdawg Yahoo IM: blue_cowdawg
Re: Bucketing,Slicing and Reporting data across multiple dimensions
by roboticus (Chancellor) on Aug 17, 2011 at 17:53 UTC


    I'd try making a 15000H x 550W bitmap (+ space for rulers on the borders) where the pixel values would show the variation: e.g., use red to represent positive and green for negative variation. That'll give you a quick feel for the data, and you can use the ruler to locate the dataset(s) of interest.


    When your only tool is a hammer, all problems look like your thumb.

      Talk about visual!! That's great!!!

      Peter L. Berghold -- Unix Professional
      Peter -at- Berghold -dot- Net; AOL IM redcowdawg Yahoo IM: blue_cowdawg
        Still too abstract to me, but I think that when you say impact quantification you say: the last column number
        use GraphViz; open DATA, "<", "/my/data"; my $plot = GraphViz->new( layout => 'dot', # <--or neato, or circo etc node => {height => '0.05', shape => 'box',fontsize => '11', fontname = +> 'Times-Roman', style=>'filled', color=>'lightgray'}, bgcolor=> 'white', center=> 'true', dpi=> '1800', #resolution of the graph, higher = sloower ... more general parameters following here... ); while (<DATA>){ my ($idname, $fooname, $blah, $blah2, $impact) = split //,$_, 5 # spli +t ID in fields next if $impact = 0 # no impact between tests, so we discard this id-> +foo node $plot->add_node($idname, label => "$idname"); #create parent node $plot->add_edge($idname => $fooname, size => $impact) # trace an arrow + from to each id to its foo #i don't remember if size is valid, but instead you could use better +the tag color => $mycolor The goal with this is that you can create easily a customized %colorha +sh (when key = impact and value is a custom color... you can have mor +e, less hot, more blue, more red... for any impact range } # whe close while loop # and we print the plot to a svg file, or gv, or png, or txt... svg fi +le looks reasonably good in a browser and you can zoom it a loot $plot->as_svg("my_big_plot.svg"); system("iceweasel my_big_plot.svg");

        something like this, I warn you that in real life this is not so simple and this is only an skeleton and WILL fail for sure until you work and tune this a little, but you can explore this idea in any case, specially if you feel comfortable with dot files.

      That's... sick and delightful. I probably won't go that route. But I'm going to keep that idea in my bag of tricks as it will almost certainly come up before too terribly long.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://920714]
Approved by blue_cowdawg
Front-paged by chrestomanci
and snow settles gently...

How do I use this? | Other CB clients
Other Users?
Others romping around the Monastery: (4)
As of 2016-12-04 21:26 GMT
Find Nodes?
    Voting Booth?
    On a regular basis, I'm most likely to spy upon:

    Results (69 votes). Check out past polls.