Consolidating biological data into occurance of each biological unit per sample in table

ejohnston7 has asked for the wisdom of the Perl Monks concerning the following question:

The first one who helps me with this to the point where I can get it to work on my dataset can get a $15 paypal payment. Or you can have me donate it to PerlMonks, or a charity, or you could let me keep the money. Either way, this is very important to me and a deadline is coming up, so I am throwing that offer out there. This is my first time posting here so please bear with me. I am new to programming and have been successfully manipulating my data in perl until now. My data is in the following format:

occurence1    A    a__bear;c__black
occurence2    B    a__wolf;c__grey
occurence3    A    a__wolf;c__white
occurence4    A    a__bear;c__
occurence5    C    a__wolf;c__grey
occurence6    C    a__bear;c__brown
occurence7    A    a__wolf;c__
occurence8    B    a__wolf;c__
occurence9    C    a__bear;c__black
occurence10    C    a__wolf;c__
occurence11    A    a__wolf;c__red
occurence12    B    a__wolf;c__grey
occurence13    C    a__wolf;c__grey
occurence14    C    a__wolf;c__grey
occurence15    B    a__bear;c__brown
occurence16    C    a__bear;c__brown
occurence17    A    a__bear;c__
occurence18    A    a__bear;c__brown
occurence19    C    a__wolf;c__white
occurence20    B    a__wolf;c__grey
occurence21    B    a__bear;c__
occurence22    B    a__wolf;c__grey
occurence23    A    a__wolf;c__grey
occurence24    A    a__bear;c__brown
occurence25    C    a__bear;c__brown
occurence26    A    a__bear;c__brown
occurence27    C    a__bear;c__
occurence28    C    a__bear;c__brown
occurence29    B    a__wolf;c__red
occurence30    B    a__wolf;c__grey
[download]

Data is tab delimited. In the real dataset, instead of 30 occurrences (data points/lines) , there are ~5 million. There are also A LOT more animals in my data than just wolves and bears (I just included those two for simplification), so it would be unhelpful to create a script that contained if statements that specifically use words like 'bear' and 'wolf', because then we would need to make if statements for the hundreds of possible animals that can be spotted. Similarly, there are hundreds of possible colors for each animal. However, if you show me how to get to my desired end with this data then I'm sure I can apply it to my real data set. The data contained within the first column is not important, it just means that each row is a unique data point, that information doesn't need to exist in my desired output file. The numbers in the second column represent the sample ID. The third column is (you guessed it!) the animal ID. 'a__' is a broad observation of the animal and 'c__' is more specific information about the observed animal. For example, for the first data point, I observed a black bear in sample A. Note that for some data points, the color of the animal was not obtained. I am trying to generate a consolidated table that gives the number of occurrences of each animal per sample. I am imagining a table where the column headers are Sample IDs A, B, C, etc. and the rows represent the animal. The table will be filled with values (integers) that represent the number of times the animal occurred in that specific sample. It should look something like this....

    A    B    C
wolf    4    7    5
bear    6    2    6
[download]

I also want a table that gives similar information, but the rows represent the possible colors of the animal, like this

    A    B    C
black    1    0    1
grey    1    5    3
white    1    0    1
brown    3    1    4
red    1    1    0
[download]

Note that for the second table, if an occurrence (data point) did not contain a color, than that data should not be considered. Notice how the sum of all of the points in the second table is less than the sum of all data points in the second table. Comments throughout the script that let me know whats being done at each step would be super helpful, so I can alter it in the future for related data manipulations. Thanks in advance to anyone who tries to help me with this. I have been reading through forums for many hours trying to get answers.

Comment on Consolidating biological data into occurance of each biological unit per sample in table Select or Download Code

Replies are listed 'Best First'.

Re: Consolidating biological data into occurance of each biological unit per sample in table
by 2teez (Vicar) on Jan 13, 2013 at 04:00 UTC

Hi ejohnston7,

Hash to the rescue... Using the data given by the OP. One can do like so:

use warnings;
use strict;

my %analyse;

while (<DATA>) {
    chomp;
    if ( my ( $animal_type, $animal_name, $animal_color ) =
        m/.+?([A-Z]).+?__(.+?);.+?__(.+?)?$/ )
    {
        $analyse{$animal_name}{$animal_type}++;
        $analyse{$animal_color}{$animal_name}{$animal_type}++
          if defined $animal_color;
    }

}

use Data::Dump;
dd \%analyse;

__DATA__
occurence1    A    a__bear;c__black
occurence2    B    a__wolf;c__grey
occurence3    A    a__wolf;c__white
occurence4    A    a__bear;c__
occurence5    C    a__wolf;c__grey
occurence6    C    a__bear;c__brown
occurence7    A    a__wolf;c__
occurence8    B    a__wolf;c__
occurence9    C    a__bear;c__black
occurence10    C    a__wolf;c__
occurence11    A    a__wolf;c__red
occurence12    B    a__wolf;c__grey
occurence13    C    a__wolf;c__grey
occurence14    C    a__wolf;c__grey
occurence15    B    a__bear;c__brown
occurence16    C    a__bear;c__brown
occurence17    A    a__bear;c__
occurence18    A    a__bear;c__brown
occurence19    C    a__wolf;c__white
occurence20    B    a__wolf;c__grey
occurence21    B    a__bear;c__
occurence22    B    a__wolf;c__grey
occurence23    A    a__wolf;c__grey
occurence24    A    a__bear;c__brown
occurence25    C    a__bear;c__brown
occurence26    A    a__bear;c__brown
occurence27    C    a__bear;c__
occurence28    C    a__bear;c__brown
occurence29    B    a__wolf;c__red
occurence30    B    a__wolf;c__grey
[download]

Output from the code above:

{
  bear  => { A => 6, B => 2, C => 6 },
  black => { bear => { A => 1, C => 1 } },
  brown => { bear => { A => 3, B => 1, C => 4 } },
  grey  => { wolf => { A => 1, B => 5, C => 3 } },
  red   => { wolf => { A => 1, B => 1 } },
  white => { wolf => { A => 1, C => 1 } },
  wolf  => { A => 4, B => 7, C => 5 },
}
[download]

Update:

$analyse{$animal_color}{$animal_name}{$animal_type}++
          if defined $animal_color;
[download]

$analyse{$animal_color}{$animal_type}++
          if defined $animal_color;
[download]

{
  bear  => { A => 6, B => 2, C => 6 },
  black => { A => 1, C => 1 },
  brown => { A => 3, B => 1, C => 4 },
  grey  => { A => 1, B => 5, C => 3 },
  red   => { A => 1, B => 1 },
  white => { A => 1, C => 1 },
  wolf  => { A => 4, B => 7, C => 5 },
}
[download]

If you tell me, I'll forget.
If you show me, I'll remember.
if you involve me, I'll understand.
--- Author unknown to me

[reply]
[d/l]
[select]

Re: Consolidating biological data into occurance of each biological unit per sample in table
by Cristoforo (Curate) on Jan 13, 2013 at 03:49 UTC

#!/usr/bin/perl
use strict;
use warnings;
use Text::Table;

my (%animal, %color);
while (<DATA>) {
    chomp;
    my (undef, $sample, $animal, $color) = split /[\t;]/;
    s/^\w__// for $animal, $color;
    
    $animal{$sample}{$animal}++;
    $color{$sample}{$color}++;
}

for my $entity (\%animal, \%color) {
    my @samples = sort keys %$entity;
    my $tb = Text::Table->new( map {title => $_}, " ", @samples);

    my %seen;
    my @keys = grep $_ && !$seen{$_}++, map keys %$_, values %$entity;
    
    for my $key (@keys) {
        $tb->load( [$key, map $entity->{$_}{$key} || 0, @samples] );
    }
    
    print $tb;    
    print "\n\n";
}
[download]

Update:

Two possible solutions I can think of would be to create a comma separated values file with the program output to be read by a spreadsheet program like Excel. Excel allows you to freeze the headers or first column, (animal or color), so you can scroll through the output and still keep the categories in sight.

The other possibility would be to load the processed data into a database like SQLite and then query the database for output.

The output from the above program is:

C:\Old_Data\perlp>perl t4.pl
     A B C
bear 6 2 6
wolf 4 7 5


      A B C
white 1 0 1
black 1 0 1
brown 3 1 4
red   1 1 0
grey  1 5 3
[download]

[reply]
[d/l]
[select]

Re: Consolidating biological data into occurance of each biological unit per sample in table
by CountZero (Bishop) on Jan 13, 2013 at 09:39 UTC

split

use Modern::Perl;

while (<DATA>) {
    chomp;
    my ($id, $sample_id, undef, $animal, undef, $color) = split /[\t_;
+]+/;
    say "$id ($sample_id) is a $animal colored $color";
}

__DATA__
occurence1    A    a__bear;c__black
occurence2    B    a__wolf;c__grey
occurence3    A    a__wolf;c__white
occurence4    A    a__bear;c__
occurence5    C    a__wolf;c__grey
occurence6    C    a__bear;c__brown
[download]

occurence1 (A) is a bear colored black
occurence2 (B) is a wolf colored grey
occurence3 (A) is a wolf colored white
occurence4 (A) is a bear colored 
occurence5 (C) is a wolf colored grey
occurence6 (C) is a bear colored brown
[download]

use Modern::Perl;
use DBI;

my $dbh = DBI->connect(
"dbi:SQLite:dbname=c:/data/strawberry-perl/perl/script-chrome/animals.
+sqlite",
    "", ""
) or die "Could not open database";
my $statement =
q{CREATE TABLE IF NOT EXISTS "main"."sightings" ("id" TEXT PRIMARY KEY
+ NOT NULL, "sampleid" TEXT, "animal" TEXT, "color" TEXT)};
my $rows = $dbh->do($statement) or die $dbh->errstr;

$dbh->{AutoCommit} = 0;
$dbh->do('BEGIN TRANSACTION');
my $sth = $dbh->prepare(
    q{INSERT INTO sightings (id, sampleid, animal, color) VALUES (?, ?
+, ?, ?)})
  or die "Could not prepare INSERT statement";
while (<DATA>) {
    chomp;
    my ( $id, $sample_id, undef, $animal, undef, $color ) = split /[\t
+_;]+/;
    $color ||= 'unknown';
    $sth->execute( $id, $sample_id, $animal, $color );
}
$dbh->commit;
[download]

SELECT sampleid, animal, count(*) FROM sightings GROUP BY sampleid, an
+imal
[download]

"A","bear","6"
"A","wolf","4"
"B","bear","2"
"B","wolf","7"
"C","bear","6"
"C","wolf","5"
[download]

SELECT sampleid, color, count(*) FROM sightings GROUP BY sampleid, col
+or
[download]

"A","black","1"
"A","brown","3"
"A","grey","1"
"A","red","1"
"A","unknown","3"
"A","white","1"
"B","brown","1"
"B","grey","5"
"B","red","1"
"B","unknown","2"
"C","black","1"
"C","brown","4"
"C","grey","3"
"C","unknown","2"
"C","white","1"
[download]

CountZero

A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

My blog: Imperial Deltronics

[reply]
[d/l]
[select]

Re: Consolidating biological data into occurance of each biological unit per sample in table
by Anonymous Monk on Jan 13, 2013 at 02:06 UTC

pivot table is pivot view

Use Text::CSV or Text::xSV

Examples at

Merge two files with similar column entries
Opening multiple files ( csvpaste.pl )
Open multiple file handles?
X, Y Table structure
How to extract the particular residues from PDB files(text csv split hash bioperl)
Select only desired features from a text (text csv split hash bioperl)
Sort on Table headers (text csv split hash bioperl)
I need help joining tab-delimited files/tables!
split then join based on common value in field
Smart way to read a file vertically?
reformatting tab delimited file
edit a CSV and "in-place" replacement
Re: Reversing a mysql table, transpose-tsv -- invert a tab-delimited table

Text::CSV

examples/csv2xls Script to onvert CSV files to M$Excel

examples/csv-check Script to check a CSV file/stream

examples/csvdiff Script to shoff diff between sorted CSV files

examples/parser-xs.pl Parse CSV stream, be forgiving on bad lines

[reply]

Re: Consolidating biological data into occurance of each biological unit per sample in table
by Cristoforo (Curate) on Jan 14, 2013 at 16:11 UTC

I am new to programming and have been successfully manipulating my data in perl until now.

ejohnston7

Considering your remark, (above), about being new to programming, it seems like the code I gave may have had some new and unfamiliar things in it that you may not understand. I know when I started Perl, I didn't get some of the common idioms that experienced programmers in Perl had been using. In this case, the use of map, (apply a change to a list to get back a new list with the changes), and the Hash of Hash data structure and how to access it.

Here is the comma separated value approach, (below), and I'll try to follow it up with some explanation. If you have any questions, do come back and someone will try to explain what you ask about.

#!/usr/bin/perl
use strict;
use warnings;

my %data;
while (<DATA>) {
    chomp;
    my (undef, $sample, @fields) = split /[\t;]/;
    
    for (@fields) {
        my ($type, $value) = split /__/;
        $data{$type}{$sample}{$value}++ if $value;    
    }
}

for my $type (keys %data) {
    my $entity = $data{$type};
    my @samples = sort keys %$entity;
    
    my %seen;
    my @keys = grep !$seen{$_}++, map keys %$_, values %$entity;
    
    open my $fh, '>', "$type.csv" or die "Unable to create '$type.csv'
+. $!";
    
    print $fh join(",", ' ', @samples), "\n";
    for my $key (@keys) {
        print $fh join(",", $key, map $entity->{$_}{$key} || 0, @sampl
+es), "\n";
    }
    
    close $fh or die "Unable to close '$type.csv'. $!";
}

__DATA__
occurence1    A    a__bear;c__black
occurence2    B    a__wolf;c__grey
occurence3    A    a__wolf;c__white
occurence4    A    a__bear;c__
occurence5    C    a__wolf;c__grey
occurence6    C    a__bear;c__brown
occurence7    A    a__wolf;c__
occurence8    B    a__wolf;c__
occurence9    C    a__bear;c__black
occurence10    C    a__wolf;c__
occurence11    A    a__wolf;c__red
occurence12    B    a__wolf;c__grey
occurence13    C    a__wolf;c__grey
occurence14    C    a__wolf;c__grey
occurence15    B    a__bear;c__brown
occurence16    C    a__bear;c__brown
occurence17    A    a__bear;c__
occurence18    A    a__bear;c__brown
occurence19    C    a__wolf;c__white
occurence20    B    a__wolf;c__grey
occurence21    B    a__bear;c__
occurence22    B    a__wolf;c__grey
occurence23    A    a__wolf;c__grey
occurence24    A    a__bear;c__brown
occurence25    C    a__bear;c__brown
occurence26    A    a__bear;c__brown
occurence27    C    a__bear;c__
occurence28    C    a__bear;c__brown
occurence29    B    a__wolf;c__red
occurence30    B    a__wolf;c__grey
[download]

C:\Old_Data\perlp>type a.csv
 ,A,B,C
bear,6,2,6
wolf,4,7,5

C:\Old_Data\perlp>type c.csv
 ,A,B,C
white,1,0,1
black,1,0,1
brown,3,1,4
red,1,1,0
grey,1,5,3
[download]

The first thing I'd like to do is provide a picture of what the %data hash contains using Data::Dumper. (I got this by placing the statement use Data::Dumper; print Dumper \%data; right after the while loop and before the for loop. I use Data::Dumper alot to see what exactly is in a data structure I created to see if everything is allright.

C:\Old_Data\perlp>perl t5.pl
$VAR1 = {
          'c' => {
                   'A' => {
                            'white' => 1,
                            'black' => 1,
                            'brown' => 3,
                            'red' => 1,
                            'grey' => 1
                          },
                   'C' => {
                            'white' => 1,
                            'black' => 1,
                            'brown' => 4,
                            'grey' => 3
                          },
                   'B' => {
                            'red' => 1,
                            'brown' => 1,
                            'grey' => 5
                          }
                 },
          'a' => {
                   'A' => {
                            'bear' => 6,
                            'wolf' => 4
                          },
                   'C' => {
                            'bear' => 6,
                            'wolf' => 5
                          },
                   'B' => {
                            'bear' => 2,
                            'wolf' => 7
                          }
                 }
        };
[download]

$data{$type}{$sample}{$value}++ if $value;

$type could be 'a' or 'c', (from your sample data). $sample is 'A', 'B' or 'C' and $value would be the name of the animal or the color. (Note that the statement ends with if $value;. In your explanation of the problem, you didn't want to count values that had no name.
occurence7 A a__wolf;c__

There is no color here so it wouldn't be added to the hash.

while (<DATA>) is shorthand for while (defined $_ = <DATA>).

chomp with no argument chomps $_ by default.

Likewise, split without an argument operates on $_ as well, split /[\t;]/.

In the for loop, for (@fields), each element of the array being iterated over is assigned to $_, not the same $_ from the while loop but $_ localized to the for loop. They do not clash.

Thats just some of the explanation, but enough to help you begin to understand hopefully. I have to leave now, but ask any questions about what you don't understand.

Hope this explains a little for you.

[reply]
[d/l]
[select]


Pathologically Eclectic Rubbish Lister
	PerlMonks