Beefy Boxes and Bandwidth Generously Provided by pair Networks Cowboy Neal with Hat
The stupid question is the question not asked
 
PerlMonks  

Consolidating biological data into occurance of each biological unit per sample in table

by ejohnston7 (Initiate)
on Jan 13, 2013 at 01:08 UTC ( #1013076=perlquestion: print w/ replies, xml ) Need Help??
ejohnston7 has asked for the wisdom of the Perl Monks concerning the following question:

The first one who helps me with this to the point where I can get it to work on my dataset can get a $15 paypal payment. Or you can have me donate it to PerlMonks, or a charity, or you could let me keep the money. Either way, this is very important to me and a deadline is coming up, so I am throwing that offer out there. This is my first time posting here so please bear with me. I am new to programming and have been successfully manipulating my data in perl until now. My data is in the following format:

occurence1 A a__bear;c__black occurence2 B a__wolf;c__grey occurence3 A a__wolf;c__white occurence4 A a__bear;c__ occurence5 C a__wolf;c__grey occurence6 C a__bear;c__brown occurence7 A a__wolf;c__ occurence8 B a__wolf;c__ occurence9 C a__bear;c__black occurence10 C a__wolf;c__ occurence11 A a__wolf;c__red occurence12 B a__wolf;c__grey occurence13 C a__wolf;c__grey occurence14 C a__wolf;c__grey occurence15 B a__bear;c__brown occurence16 C a__bear;c__brown occurence17 A a__bear;c__ occurence18 A a__bear;c__brown occurence19 C a__wolf;c__white occurence20 B a__wolf;c__grey occurence21 B a__bear;c__ occurence22 B a__wolf;c__grey occurence23 A a__wolf;c__grey occurence24 A a__bear;c__brown occurence25 C a__bear;c__brown occurence26 A a__bear;c__brown occurence27 C a__bear;c__ occurence28 C a__bear;c__brown occurence29 B a__wolf;c__red occurence30 B a__wolf;c__grey

Data is tab delimited. In the real dataset, instead of 30 occurrences (data points/lines) , there are ~5 million. There are also A LOT more animals in my data than just wolves and bears (I just included those two for simplification), so it would be unhelpful to create a script that contained if statements that specifically use words like 'bear' and 'wolf', because then we would need to make if statements for the hundreds of possible animals that can be spotted. Similarly, there are hundreds of possible colors for each animal. However, if you show me how to get to my desired end with this data then I'm sure I can apply it to my real data set. The data contained within the first column is not important, it just means that each row is a unique data point, that information doesn't need to exist in my desired output file. The numbers in the second column represent the sample ID. The third column is (you guessed it!) the animal ID. 'a__' is a broad observation of the animal and 'c__' is more specific information about the observed animal. For example, for the first data point, I observed a black bear in sample A. Note that for some data points, the color of the animal was not obtained. I am trying to generate a consolidated table that gives the number of occurrences of each animal per sample. I am imagining a table where the column headers are Sample IDs A, B, C, etc. and the rows represent the animal. The table will be filled with values (integers) that represent the number of times the animal occurred in that specific sample. It should look something like this....

A B C wolf 4 7 5 bear 6 2 6

I also want a table that gives similar information, but the rows represent the possible colors of the animal, like this

A B C black 1 0 1 grey 1 5 3 white 1 0 1 brown 3 1 4 red 1 1 0

Note that for the second table, if an occurrence (data point) did not contain a color, than that data should not be considered. Notice how the sum of all of the points in the second table is less than the sum of all data points in the second table. Comments throughout the script that let me know whats being done at each step would be super helpful, so I can alter it in the future for related data manipulations. Thanks in advance to anyone who tries to help me with this. I have been reading through forums for many hours trying to get answers.

Comment on Consolidating biological data into occurance of each biological unit per sample in table
Select or Download Code
Re: Consolidating biological data into occurance of each biological unit per sample in table
by Anonymous Monk on Jan 13, 2013 at 02:06 UTC
Re: Consolidating biological data into occurance of each biological unit per sample in table
by Cristoforo (Chaplain) on Jan 13, 2013 at 03:49 UTC
    This would apply only to the data you provided - an animal name and its color, coded as they are.
    #!/usr/bin/perl use strict; use warnings; use Text::Table; my (%animal, %color); while (<DATA>) { chomp; my (undef, $sample, $animal, $color) = split /[\t;]/; s/^\w__// for $animal, $color; $animal{$sample}{$animal}++; $color{$sample}{$color}++; } for my $entity (\%animal, \%color) { my @samples = sort keys %$entity; my $tb = Text::Table->new( map {title => $_}, " ", @samples); my %seen; my @keys = grep $_ && !$seen{$_}++, map keys %$_, values %$entity; for my $key (@keys) { $tb->load( [$key, map $entity->{$_}{$key} || 0, @samples] ); } print $tb; print "\n\n"; }
    Update: Upon reflection, several points occured to me. For one thing, this code doesn't deal with any possible additional fields other than 'animal' and 'color'. That could be fixed with some minor changes to the code. And using Text::Table for the presentation is probably the wrong choice given that there may be 100's of animals/colors (as well as more than a dozen or so 'samples', (A to Z??). I think Text::Tables is more suited to smaller datasets that you can see on 1 or 2 pages of output.

    Two possible solutions I can think of would be to create a comma separated values file with the program output to be read by a spreadsheet program like Excel. Excel allows you to freeze the headers or first column, (animal or color), so you can scroll through the output and still keep the categories in sight.

    The other possibility would be to load the processed data into a database like SQLite and then query the database for output.

    The output from the above program is:

    C:\Old_Data\perlp>perl t4.pl A B C bear 6 2 6 wolf 4 7 5 A B C white 1 0 1 black 1 0 1 brown 3 1 4 red 1 1 0 grey 1 5 3
Re: Consolidating biological data into occurance of each biological unit per sample in table
by 2teez (Priest) on Jan 13, 2013 at 04:00 UTC

    Hi ejohnston7,

    Hash to the rescue... Using the data given by the OP. One can do like so:

    use warnings; use strict; my %analyse; while (<DATA>) { chomp; if ( my ( $animal_type, $animal_name, $animal_color ) = m/.+?([A-Z]).+?__(.+?);.+?__(.+?)?$/ ) { $analyse{$animal_name}{$animal_type}++; $analyse{$animal_color}{$animal_name}{$animal_type}++ if defined $animal_color; } } use Data::Dump; dd \%analyse; __DATA__ occurence1 A a__bear;c__black occurence2 B a__wolf;c__grey occurence3 A a__wolf;c__white occurence4 A a__bear;c__ occurence5 C a__wolf;c__grey occurence6 C a__bear;c__brown occurence7 A a__wolf;c__ occurence8 B a__wolf;c__ occurence9 C a__bear;c__black occurence10 C a__wolf;c__ occurence11 A a__wolf;c__red occurence12 B a__wolf;c__grey occurence13 C a__wolf;c__grey occurence14 C a__wolf;c__grey occurence15 B a__bear;c__brown occurence16 C a__bear;c__brown occurence17 A a__bear;c__ occurence18 A a__bear;c__brown occurence19 C a__wolf;c__white occurence20 B a__wolf;c__grey occurence21 B a__bear;c__ occurence22 B a__wolf;c__grey occurence23 A a__wolf;c__grey occurence24 A a__bear;c__brown occurence25 C a__bear;c__brown occurence26 A a__bear;c__brown occurence27 C a__bear;c__ occurence28 C a__bear;c__brown occurence29 B a__wolf;c__red occurence30 B a__wolf;c__grey
    The output shows the two tables together. The desired output is left for the OP to make happen.
    Output from the code above:
    { bear => { A => 6, B => 2, C => 6 }, black => { bear => { A => 1, C => 1 } }, brown => { bear => { A => 3, B => 1, C => 4 } }, grey => { wolf => { A => 1, B => 5, C => 3 } }, red => { wolf => { A => 1, B => 1 } }, white => { wolf => { A => 1, C => 1 } }, wolf => { A => 4, B => 7, C => 5 }, }
    Update:
    And if you just care about just getting the color, with the frequency of type, then you can change this:
    $analyse{$animal_color}{$animal_name}{$animal_type}++ if defined $animal_color;
    to this
    $analyse{$animal_color}{$animal_type}++ if defined $animal_color;
    then your output would be:
    { bear => { A => 6, B => 2, C => 6 }, black => { A => 1, C => 1 }, brown => { A => 3, B => 1, C => 4 }, grey => { A => 1, B => 5, C => 3 }, red => { A => 1, B => 1 }, white => { A => 1, C => 1 }, wolf => { A => 4, B => 7, C => 5 }, }

    If you tell me, I'll forget.
    If you show me, I'll remember.
    if you involve me, I'll understand.
    --- Author unknown to me
Re: Consolidating biological data into occurance of each biological unit per sample in table
by CountZero (Chancellor) on Jan 13, 2013 at 09:39 UTC
    One split can get you your data:
    use Modern::Perl; while (<DATA>) { chomp; my ($id, $sample_id, undef, $animal, undef, $color) = split /[\t_; +]+/; say "$id ($sample_id) is a $animal colored $color"; } __DATA__ occurence1 A a__bear;c__black occurence2 B a__wolf;c__grey occurence3 A a__wolf;c__white occurence4 A a__bear;c__ occurence5 C a__wolf;c__grey occurence6 C a__bear;c__brown
    Output:
    occurence1 (A) is a bear colored black occurence2 (B) is a wolf colored grey occurence3 (A) is a wolf colored white occurence4 (A) is a bear colored occurence5 (C) is a wolf colored grey occurence6 (C) is a bear colored brown
    Now that we have the data, why not putting it into a database such as SQLite?
    use Modern::Perl; use DBI; my $dbh = DBI->connect( "dbi:SQLite:dbname=c:/data/strawberry-perl/perl/script-chrome/animals. +sqlite", "", "" ) or die "Could not open database"; my $statement = q{CREATE TABLE IF NOT EXISTS "main"."sightings" ("id" TEXT PRIMARY KEY + NOT NULL, "sampleid" TEXT, "animal" TEXT, "color" TEXT)}; my $rows = $dbh->do($statement) or die $dbh->errstr; $dbh->{AutoCommit} = 0; $dbh->do('BEGIN TRANSACTION'); my $sth = $dbh->prepare( q{INSERT INTO sightings (id, sampleid, animal, color) VALUES (?, ? +, ?, ?)}) or die "Could not prepare INSERT statement"; while (<DATA>) { chomp; my ( $id, $sample_id, undef, $animal, undef, $color ) = split /[\t +_;]+/; $color ||= 'unknown'; $sth->execute( $id, $sample_id, $animal, $color ); } $dbh->commit;
    Now we can interrogate the database and use aggregate functions to get your answers:
    SELECT sampleid, animal, count(*) FROM sightings GROUP BY sampleid, an +imal
    will return
    "A","bear","6" "A","wolf","4" "B","bear","2" "B","wolf","7" "C","bear","6" "C","wolf","5"
    And
    SELECT sampleid, color, count(*) FROM sightings GROUP BY sampleid, col +or
    will give you
    "A","black","1" "A","brown","3" "A","grey","1" "A","red","1" "A","unknown","3" "A","white","1" "B","brown","1" "B","grey","5" "B","red","1" "B","unknown","2" "C","black","1" "C","brown","4" "C","grey","3" "C","unknown","2" "C","white","1"

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

    My blog: Imperial Deltronics
Re: Consolidating biological data into occurance of each biological unit per sample in table
by Cristoforo (Chaplain) on Jan 14, 2013 at 16:11 UTC
    I am new to programming and have been successfully manipulating my data in perl until now.
    Hello again ejohnston7!

    Considering your remark, (above), about being new to programming, it seems like the code I gave may have had some new and unfamiliar things in it that you may not understand. I know when I started Perl, I didn't get some of the common idioms that experienced programmers in Perl had been using. In this case, the use of map, (apply a change to a list to get back a new list with the changes), and the Hash of Hash data structure and how to access it.

    Here is the comma separated value approach, (below), and I'll try to follow it up with some explanation. If you have any questions, do come back and someone will try to explain what you ask about.

    #!/usr/bin/perl use strict; use warnings; my %data; while (<DATA>) { chomp; my (undef, $sample, @fields) = split /[\t;]/; for (@fields) { my ($type, $value) = split /__/; $data{$type}{$sample}{$value}++ if $value; } } for my $type (keys %data) { my $entity = $data{$type}; my @samples = sort keys %$entity; my %seen; my @keys = grep !$seen{$_}++, map keys %$_, values %$entity; open my $fh, '>', "$type.csv" or die "Unable to create '$type.csv' +. $!"; print $fh join(",", ' ', @samples), "\n"; for my $key (@keys) { print $fh join(",", $key, map $entity->{$_}{$key} || 0, @sampl +es), "\n"; } close $fh or die "Unable to close '$type.csv'. $!"; } __DATA__ occurence1 A a__bear;c__black occurence2 B a__wolf;c__grey occurence3 A a__wolf;c__white occurence4 A a__bear;c__ occurence5 C a__wolf;c__grey occurence6 C a__bear;c__brown occurence7 A a__wolf;c__ occurence8 B a__wolf;c__ occurence9 C a__bear;c__black occurence10 C a__wolf;c__ occurence11 A a__wolf;c__red occurence12 B a__wolf;c__grey occurence13 C a__wolf;c__grey occurence14 C a__wolf;c__grey occurence15 B a__bear;c__brown occurence16 C a__bear;c__brown occurence17 A a__bear;c__ occurence18 A a__bear;c__brown occurence19 C a__wolf;c__white occurence20 B a__wolf;c__grey occurence21 B a__bear;c__ occurence22 B a__wolf;c__grey occurence23 A a__wolf;c__grey occurence24 A a__bear;c__brown occurence25 C a__bear;c__brown occurence26 A a__bear;c__brown occurence27 C a__bear;c__ occurence28 C a__bear;c__brown occurence29 B a__wolf;c__red occurence30 B a__wolf;c__grey
    Files created by the program above and readable by Excel are:
    C:\Old_Data\perlp>type a.csv ,A,B,C bear,6,2,6 wolf,4,7,5 C:\Old_Data\perlp>type c.csv ,A,B,C white,1,0,1 black,1,0,1 brown,3,1,4 red,1,1,0 grey,1,5,3

    The first thing I'd like to do is provide a picture of what the %data hash contains using Data::Dumper. (I got this by placing the statement use Data::Dumper; print Dumper \%data; right after the while loop and before the for loop. I use Data::Dumper alot to see what exactly is in a data structure I created to see if everything is allright.

    C:\Old_Data\perlp>perl t5.pl $VAR1 = { 'c' => { 'A' => { 'white' => 1, 'black' => 1, 'brown' => 3, 'red' => 1, 'grey' => 1 }, 'C' => { 'white' => 1, 'black' => 1, 'brown' => 4, 'grey' => 3 }, 'B' => { 'red' => 1, 'brown' => 1, 'grey' => 5 } }, 'a' => { 'A' => { 'bear' => 6, 'wolf' => 4 }, 'C' => { 'bear' => 6, 'wolf' => 5 }, 'B' => { 'bear' => 2, 'wolf' => 7 } } };
    I created a hash of a hash of a hash, (with this statement, $data{$type}{$sample}{$value}++ if $value;).

    $type could be 'a' or 'c', (from your sample data). $sample is 'A', 'B' or 'C' and $value would be the name of the animal or the color. (Note that the statement ends with if $value;. In your explanation of the problem, you didn't want to count values that had no name.
    occurence7    A    a__wolf;c__

    There is no color here so it wouldn't be added to the hash.

    while (<DATA>) is shorthand for while (defined $_ = <DATA>).

    chomp with no argument chomps $_ by default.

    Likewise, split without an argument operates on $_ as well, split /[\t;]/.

    In the for loop, for (@fields), each element of the array being iterated over is assigned to $_, not the same $_ from the while loop but $_ localized to the for loop. They do not clash.

    Thats just some of the explanation, but enough to help you begin to understand hopefully. I have to leave now, but ask any questions about what you don't understand.

    Hope this explains a little for you.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1013076]
Approved by 2teez
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others wandering the Monastery: (6)
As of 2014-04-19 11:01 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    April first is:







    Results (480 votes), past polls