Help munging tabular data

c4onastick has asked for the wisdom of the Perl Monks concerning the following question:

Greetings Monks,

I've got some data in tab-separated values that comes in the form:

Data Set 1		Data Set 2		Data Set 3
X units	Y units	X units	Y units	X units	Y units
x1	y1	x1	y1	x1	y1
x2	y2	x2	y2	x2	y2
x3	y3	x3	y3	x3	y3
...	...	...	...	...	...

I'd like to get that in to either a hash or an array by reading the file in from STDIN. I need to be able to query the data asking things like, "What's Data Set 1's y value at Data Set 3's x3 value?" Wherein lies the problem, I need to keep the "Data Set" labels associated with the data. I started writing it like this:

use warnings;

my %data_sets;

while(<>){
   # I did the $line thing because there's some garbage at the top of 
+the file that I don't want
   my $line++;
   if($line == 2){
      while(/\G([^\t]+)/g){
         $data_sets{$_} = {};
      }
   }
   if($line > 3){
      $data_sets{}; #<- ran into the problem right here! The script do
+esn't know what data set its on!
   }
[download]

So my question is: what would be the appropriate complex internal data structure to parse the data into (I was going with a hash of hashes, maybe a hash of arrays?) I should mention that the reason I have this problem is that Data Set 1's x1 does not necessarily equal Data Set 2's x1.

I've looked at Data::Table, and for the analysis that I need to do on the data, it doesn't quite seem to fit my needs.
Thanks in advance for your wisdom.

Comment on Help munging tabular data Download Code

Replies are listed 'Best First'.
Re: Help munging tabular data by BrowserUk (Patriarch) on Aug 02, 2007 at 00:13 UTC
Several structures are possible. I've used a HoHoA here. You might want to strip spaces from the hash keys if you have to make many references to them in your code. #! perl -slw use strict; use Data::Dump qw[ pp ]; my %data; chomp( my @L1Headers = split "\t", scalar <DATA> ); chomp( my @L2Headers = split "\t", scalar <DATA> ); @data{ @L1Headers } = map{ +{ map{ $_ => undef } @L2Headers[ 0, 1 ] } } 1 .. @L1Headers; while( <DATA> ) { chomp; my @values = split "\t"; for my $l1 ( @L1Headers ) { for my $l2 ( keys %{ $data{ $l1 } } ) { push @{ $data{ $l1 }{ $l2 } }, shift @values; } } } print pp \%data; ## "What's Data Set 1's y value at Data Set 3's x3 value?" for ( 0 .. $#{ $data{ 'Data Set 1' }{ 'X units' } } ) { $data{ 'Data Set 3' }{ 'X units' }[ $_ ] eq 'x3' and print "DS3-X == x3, DS1-Y == ", $data{ 'Data Set 1' }{ 'Y units' }[ $_ ]; } __DATA__ Data Set 1 Data Set 2 Data Set 3 X units Y units X units Y units X units Y units x1 y1 x1 y1 x1 y1 x2 y2 x2 y2 x2 y2 x3 y3 x3 y3 x3 y3 [download] Outputs `c:\test>junk2 { "Data Set 1" =>{ "X units" =>["x1","x2","x3"], "Y units" =>["y1","y2 +","y3"] }, "Data Set 2" =>{ "X units" =>["x1","x2","x3"], "Y units" =>["y1","y2 +","y3"] }, "Data Set 3" =>{ "X units" =>["x1","x2","x3"], "Y units" =>["y1","y2 +","y3"] }, } DS3-X == x3, DS1-Y == y3` [download] Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply] [d/l] [select]
Re: Help munging tabular data by GrandFather (Saint) on Aug 02, 2007 at 00:58 UTC
Depends how you want to access it more than anything else. Either a HoHoA, a HoAoA or a HoAoH would fit the bill. This is a HoHoA variant: use strict; use warnings; my %data_sets; my @dataHeaders = split /,/, ($_ = <DATA>, chomp, $_); my @setHeaders = split /,/, ($_ = <DATA>, chomp, $_); splice @setHeaders, 2; # Assume paired columns and the same for all da +ta sets while (<DATA>) { chomp; my @fields = split /,/; Outer: for my $dataSet (@dataHeaders) { for my $field (@setHeaders) { last Outer unless @fields; push @{$data_sets{$dataSet}{$field}}, shift @fields; } } } my $lu = $data_sets{'Data Set 3'}{'X units'}[2]; # Data Set 3's x3 val +ue my @matches = grep {$data_sets{'Data Set 1'}{'X units'}[$_] eq $lu} 0 .. $#{$data_sets{'Data Set 1'}{'X units'}}; for my $index (@matches) { print "Data Set 1 element X" . ($index + 1) . "'s Y = $data_sets{'Data Set 1'}{'Y units'}[$in +dex]\n"; } __DATA__ Data Set 1,Data Set 2,Data Set 3 X units,Y units,X units,Y units,X units,Y units 1,y1,x1,y1,2,y1 2,y2,x2,y2,3,y2 3,y3,x3,y3,1,y3 [download] Prints: `Data Set 1 element X1's Y = y1` [download] The s/\t/,/g was to avoid unnoticed editor foibles. DWIM is Perl's answer to Gödel	[reply] [d/l] [select]
Re^2: Help munging tabular data by aquarium (Curate) on Aug 02, 2007 at 04:21 UTC
the above are excellent answers if you're after a single way to query the data; with a database/SQL implementation, you can filter the results many ways, including filtering on particular values of fields etc. the hardest line to type correctly is: stty erase ^H	[reply]
Re: Help munging tabular data by aquarium (Curate) on Aug 02, 2007 at 00:13 UTC
Try DBD::CSV module...so you can run SQL queries against the data set. btw..i would name the fields (columns) as follows for purposes of making the table more SQL friendly: x1,y1,x2,y2,x3,y3. Whilst this is not a normalized table, it would do the job..if you don't plan to add more tables. if you do plan to add more tables and functionality, then it would be best to normalize the table(s). This table normalized would become two tables, something like: table: Dataset fields: dataset_id, description primary key(s): dataset_id table: Point fields: dataset_id, sequence_id,x,y primary key(s): dataset_id + sequence_id Then the SQL for you question becomes a simple join and with a "where" clause. the hardest line to type correctly is: stty erase ^H	[reply]
Re: Help munging tabular data by Cristoforo (Curate) on Aug 04, 2007 at 03:42 UTC
Along the lines of what aquarium suggested, here is a small program to create a database approach using DBD::SQLite. Its been a while since I've seen database coding and there are some things that I wasn't able to do like setting a dual primary key on one of the tables. Plus, if you declare a type to be INTEGER PRIMARY KEY, in SQLite it will autoincrement that column for each INSERT INTO statement. But I haven't got that worked out either At the end of the code are queries for each of the 2 tables (showing all their contents). Chris #!/usr/bin/perl use strict; use warnings; use DBI; my @data; chomp(my @sets = split "\t", <DATA>); <DATA>; while (<DATA>) { my @xys = split; for my $dataset (0..@xys/2-1) { push @{ $data[$dataset] }, [splice @xys, 0, 2]; } } my $dbh = DBI->connect("dbi:SQLite:dbname=datapoints_01.lite","","") o +r die $!; $dbh->do(qq{ CREATE TABLE data_sets (id INTEGER, name TEXT) }); my $sth = $dbh->prepare(q{INSERT INTO data_sets VALUES(?, ?)}) or die +$dbh->errstr; my $id; for my $name (@sets) { $sth->execute(++$id, $name) or die $dbh->errstr; } $dbh->do(qq{ CREATE TABLE data_pts (set_id INTEGER, id INTEGER, x INTEGER, y INTEGER) }); $sth = $dbh->prepare(q{INSERT INTO data_pts VALUES(?, ?, ?, ?)}) or die $dbh->errstr; for my $set_id (0..$#data) { for my $point_id (0..$#{ $data[$set_id] }) { $sth->execute($set_id+1, $point_id+1, @{ $data[$set_id][$point +_id] }) or die $dbh->errstr; } } $sth->finish; $dbh->disconnect or die $dbh->errstr; __DATA__ Data Set 1 Data Set 2 Data Set 3 X units Y units X units Y units X units Y units 1 2 10 20 100 200 3 4 30 40 300 400 5 6 50 60 500 600 __END__ C:\perlp>sqlite3 datapoints_01.lite SQLite version 3.3.8 Enter ".help" for instructions sqlite> select * from data_sets; 1\|Data Set 1 2\|Data Set 2 3\|Data Set 3 sqlite> select * from data_pts; 1\|1\|1\|2 1\|2\|3\|4 1\|3\|5\|6 2\|1\|10\|20 2\|2\|30\|40 2\|3\|50\|60 3\|1\|100\|200 3\|2\|300\|400 3\|3\|500\|600 sqlite> [download]	[reply] [d/l]


No such thing as a small change
	PerlMonks