Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW

Help munging tabular data

by c4onastick (Friar)
on Aug 01, 2007 at 23:28 UTC ( #630186=perlquestion: print w/replies, xml ) Need Help??
c4onastick has asked for the wisdom of the Perl Monks concerning the following question:

Greetings Monks,

I've got some data in tab-separated values that comes in the form:
Data Set 1Data Set 2Data Set 3
X unitsY unitsX unitsY unitsX unitsY units
I'd like to get that in to either a hash or an array by reading the file in from STDIN. I need to be able to query the data asking things like, "What's Data Set 1's y value at Data Set 3's x3 value?" Wherein lies the problem, I need to keep the "Data Set" labels associated with the data. I started writing it like this:
use warnings; my %data_sets; while(<>){ # I did the $line thing because there's some garbage at the top of +the file that I don't want my $line++; if($line == 2){ while(/\G([^\t]+)/g){ $data_sets{$_} = {}; } } if($line > 3){ $data_sets{}; #<- ran into the problem right here! The script do +esn't know what data set its on! }

So my question is: what would be the appropriate complex internal data structure to parse the data into (I was going with a hash of hashes, maybe a hash of arrays?) I should mention that the reason I have this problem is that Data Set 1's x1 does not necessarily equal Data Set 2's x1.

I've looked at Data::Table, and for the analysis that I need to do on the data, it doesn't quite seem to fit my needs.
Thanks in advance for your wisdom.

Replies are listed 'Best First'.
Re: Help munging tabular data
by BrowserUk (Pope) on Aug 02, 2007 at 00:13 UTC

    Several structures are possible. I've used a HoHoA here. You might want to strip spaces from the hash keys if you have to make many references to them in your code.

    #! perl -slw use strict; use Data::Dump qw[ pp ]; my %data; chomp( my @L1Headers = split "\t", scalar <DATA> ); chomp( my @L2Headers = split "\t", scalar <DATA> ); @data{ @L1Headers } = map{ +{ map{ $_ => undef } @L2Headers[ 0, 1 ] } } 1 .. @L1Headers; while( <DATA> ) { chomp; my @values = split "\t"; for my $l1 ( @L1Headers ) { for my $l2 ( keys %{ $data{ $l1 } } ) { push @{ $data{ $l1 }{ $l2 } }, shift @values; } } } print pp \%data; ## "What's Data Set 1's y value at Data Set 3's x3 value?" for ( 0 .. $#{ $data{ 'Data Set 1' }{ 'X units' } } ) { $data{ 'Data Set 3' }{ 'X units' }[ $_ ] eq 'x3' and print "DS3-X == x3, DS1-Y == ", $data{ 'Data Set 1' }{ 'Y units' }[ $_ ]; } __DATA__ Data Set 1 Data Set 2 Data Set 3 X units Y units X units Y units X units Y units x1 y1 x1 y1 x1 y1 x2 y2 x2 y2 x2 y2 x3 y3 x3 y3 x3 y3


    c:\test>junk2 { "Data Set 1" =>{ "X units" =>["x1","x2","x3"], "Y units" =>["y1","y2 +","y3"] }, "Data Set 2" =>{ "X units" =>["x1","x2","x3"], "Y units" =>["y1","y2 +","y3"] }, "Data Set 3" =>{ "X units" =>["x1","x2","x3"], "Y units" =>["y1","y2 +","y3"] }, } DS3-X == x3, DS1-Y == y3

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Help munging tabular data
by GrandFather (Sage) on Aug 02, 2007 at 00:58 UTC

    Depends how you want to access it more than anything else. Either a HoHoA, a HoAoA or a HoAoH would fit the bill. This is a HoHoA variant:

    use strict; use warnings; my %data_sets; my @dataHeaders = split /,/, ($_ = <DATA>, chomp, $_); my @setHeaders = split /,/, ($_ = <DATA>, chomp, $_); splice @setHeaders, 2; # Assume paired columns and the same for all da +ta sets while (<DATA>) { chomp; my @fields = split /,/; Outer: for my $dataSet (@dataHeaders) { for my $field (@setHeaders) { last Outer unless @fields; push @{$data_sets{$dataSet}{$field}}, shift @fields; } } } my $lu = $data_sets{'Data Set 3'}{'X units'}[2]; # Data Set 3's x3 val +ue my @matches = grep {$data_sets{'Data Set 1'}{'X units'}[$_] eq $lu} 0 .. $#{$data_sets{'Data Set 1'}{'X units'}}; for my $index (@matches) { print "Data Set 1 element X" . ($index + 1) . "'s Y = $data_sets{'Data Set 1'}{'Y units'}[$in +dex]\n"; } __DATA__ Data Set 1,Data Set 2,Data Set 3 X units,Y units,X units,Y units,X units,Y units 1,y1,x1,y1,2,y1 2,y2,x2,y2,3,y2 3,y3,x3,y3,1,y3


    Data Set 1 element X1's Y = y1

    The s/\t/,/g was to avoid unnoticed editor foibles.

    DWIM is Perl's answer to Gödel
      the above are excellent answers if you're after a single way to query the data; with a database/SQL implementation, you can filter the results many ways, including filtering on particular values of fields etc.
      the hardest line to type correctly is: stty erase ^H
Re: Help munging tabular data
by aquarium (Curate) on Aug 02, 2007 at 00:13 UTC
    Try DBD::CSV you can run SQL queries against the data set. btw..i would name the fields (columns) as follows for purposes of making the table more SQL friendly: x1,y1,x2,y2,x3,y3. Whilst this is not a normalized table, it would do the job..if you don't plan to add more tables. if you do plan to add more tables and functionality, then it would be best to normalize the table(s).
    This table normalized would become two tables, something like:
    table: Dataset
    fields: dataset_id, description
    primary key(s): dataset_id

    table: Point
    fields: dataset_id, sequence_id,x,y
    primary key(s): dataset_id + sequence_id

    Then the SQL for you question becomes a simple join and with a "where" clause.
    the hardest line to type correctly is: stty erase ^H
Re: Help munging tabular data
by Cristoforo (Curate) on Aug 04, 2007 at 03:42 UTC
    Along the lines of what aquarium suggested, here is a small program to create a database approach using DBD::SQLite. Its been a while since I've seen database coding and there are some things that I wasn't able to do like setting a dual primary key on one of the tables. Plus, if you declare a type to be INTEGER PRIMARY KEY, in SQLite it will autoincrement that column for each INSERT INTO statement. But I haven't got that worked out either

    At the end of the code are queries for each of the 2 tables (showing all their contents).

    #!/usr/bin/perl use strict; use warnings; use DBI; my @data; chomp(my @sets = split "\t", <DATA>); <DATA>; while (<DATA>) { my @xys = split; for my $dataset (0..@xys/2-1) { push @{ $data[$dataset] }, [splice @xys, 0, 2]; } } my $dbh = DBI->connect("dbi:SQLite:dbname=datapoints_01.lite","","") o +r die $!; $dbh->do(qq{ CREATE TABLE data_sets (id INTEGER, name TEXT) }); my $sth = $dbh->prepare(q{INSERT INTO data_sets VALUES(?, ?)}) or die +$dbh->errstr; my $id; for my $name (@sets) { $sth->execute(++$id, $name) or die $dbh->errstr; } $dbh->do(qq{ CREATE TABLE data_pts (set_id INTEGER, id INTEGER, x INTEGER, y INTEGER) }); $sth = $dbh->prepare(q{INSERT INTO data_pts VALUES(?, ?, ?, ?)}) or die $dbh->errstr; for my $set_id (0..$#data) { for my $point_id (0..$#{ $data[$set_id] }) { $sth->execute($set_id+1, $point_id+1, @{ $data[$set_id][$point +_id] }) or die $dbh->errstr; } } $sth->finish; $dbh->disconnect or die $dbh->errstr; __DATA__ Data Set 1 Data Set 2 Data Set 3 X units Y units X units Y units X units Y units 1 2 10 20 100 200 3 4 30 40 300 400 5 6 50 60 500 600 __END__ C:\perlp>sqlite3 datapoints_01.lite SQLite version 3.3.8 Enter ".help" for instructions sqlite> select * from data_sets; 1|Data Set 1 2|Data Set 2 3|Data Set 3 sqlite> select * from data_pts; 1|1|1|2 1|2|3|4 1|3|5|6 2|1|10|20 2|2|30|40 2|3|50|60 3|1|100|200 3|2|300|400 3|3|500|600 sqlite>

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://630186]
Approved by GrandFather
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others lurking in the Monastery: (5)
As of 2017-12-14 06:45 GMT
Find Nodes?
    Voting Booth?
    What programming language do you hate the most?

    Results (384 votes). Check out past polls.