Re: Consolidating biological data into occurance of each biological unit per sample in table

in reply to Consolidating biological data into occurance of each biological unit per sample in table

I am new to programming and have been successfully manipulating my data in perl until now.

Hello again ejohnston7!

Considering your remark, (above), about being new to programming, it seems like the code I gave may have had some new and unfamiliar things in it that you may not understand. I know when I started Perl, I didn't get some of the common idioms that experienced programmers in Perl had been using. In this case, the use of map, (apply a change to a list to get back a new list with the changes), and the Hash of Hash data structure and how to access it.

Here is the comma separated value approach, (below), and I'll try to follow it up with some explanation. If you have any questions, do come back and someone will try to explain what you ask about.

#!/usr/bin/perl
use strict;
use warnings;

my %data;
while (<DATA>) {
    chomp;
    my (undef, $sample, @fields) = split /[\t;]/;
    
    for (@fields) {
        my ($type, $value) = split /__/;
        $data{$type}{$sample}{$value}++ if $value;    
    }
}

for my $type (keys %data) {
    my $entity = $data{$type};
    my @samples = sort keys %$entity;
    
    my %seen;
    my @keys = grep !$seen{$_}++, map keys %$_, values %$entity;
    
    open my $fh, '>', "$type.csv" or die "Unable to create '$type.csv'
+. $!";
    
    print $fh join(",", ' ', @samples), "\n";
    for my $key (@keys) {
        print $fh join(",", $key, map $entity->{$_}{$key} || 0, @sampl
+es), "\n";
    }
    
    close $fh or die "Unable to close '$type.csv'. $!";
}

__DATA__
occurence1    A    a__bear;c__black
occurence2    B    a__wolf;c__grey
occurence3    A    a__wolf;c__white
occurence4    A    a__bear;c__
occurence5    C    a__wolf;c__grey
occurence6    C    a__bear;c__brown
occurence7    A    a__wolf;c__
occurence8    B    a__wolf;c__
occurence9    C    a__bear;c__black
occurence10    C    a__wolf;c__
occurence11    A    a__wolf;c__red
occurence12    B    a__wolf;c__grey
occurence13    C    a__wolf;c__grey
occurence14    C    a__wolf;c__grey
occurence15    B    a__bear;c__brown
occurence16    C    a__bear;c__brown
occurence17    A    a__bear;c__
occurence18    A    a__bear;c__brown
occurence19    C    a__wolf;c__white
occurence20    B    a__wolf;c__grey
occurence21    B    a__bear;c__
occurence22    B    a__wolf;c__grey
occurence23    A    a__wolf;c__grey
occurence24    A    a__bear;c__brown
occurence25    C    a__bear;c__brown
occurence26    A    a__bear;c__brown
occurence27    C    a__bear;c__
occurence28    C    a__bear;c__brown
occurence29    B    a__wolf;c__red
occurence30    B    a__wolf;c__grey
[download]

Files created by the program above and readable by Excel are:

C:\Old_Data\perlp>type a.csv
 ,A,B,C
bear,6,2,6
wolf,4,7,5

C:\Old_Data\perlp>type c.csv
 ,A,B,C
white,1,0,1
black,1,0,1
brown,3,1,4
red,1,1,0
grey,1,5,3
[download]

The first thing I'd like to do is provide a picture of what the %data hash contains using Data::Dumper. (I got this by placing the statement use Data::Dumper; print Dumper \%data; right after the while loop and before the for loop. I use Data::Dumper alot to see what exactly is in a data structure I created to see if everything is allright.

C:\Old_Data\perlp>perl t5.pl
$VAR1 = {
          'c' => {
                   'A' => {
                            'white' => 1,
                            'black' => 1,
                            'brown' => 3,
                            'red' => 1,
                            'grey' => 1
                          },
                   'C' => {
                            'white' => 1,
                            'black' => 1,
                            'brown' => 4,
                            'grey' => 3
                          },
                   'B' => {
                            'red' => 1,
                            'brown' => 1,
                            'grey' => 5
                          }
                 },
          'a' => {
                   'A' => {
                            'bear' => 6,
                            'wolf' => 4
                          },
                   'C' => {
                            'bear' => 6,
                            'wolf' => 5
                          },
                   'B' => {
                            'bear' => 2,
                            'wolf' => 7
                          }
                 }
        };
[download]

I created a hash of a hash of a hash, (with this statement, $data{$type}{$sample}{$value}++ if $value;).

$type could be 'a' or 'c', (from your sample data). $sample is 'A', 'B' or 'C' and $value would be the name of the animal or the color. (Note that the statement ends with if $value;. In your explanation of the problem, you didn't want to count values that had no name.
occurence7 A a__wolf;c__

There is no color here so it wouldn't be added to the hash.

while (<DATA>) is shorthand for while (defined $_ = <DATA>).

chomp with no argument chomps $_ by default.

Likewise, split without an argument operates on $_ as well, split /[\t;]/.

In the for loop, for (@fields), each element of the array being iterated over is assigned to $_, not the same $_ from the while loop but $_ localized to the for loop. They do not clash.

Thats just some of the explanation, but enough to help you begin to understand hopefully. I have to leave now, but ask any questions about what you don't understand.

Hope this explains a little for you.

Comment on Re: Consolidating biological data into occurance of each biological unit per sample in table Select or Download Code

In Section Seekers of Perl Wisdom