Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Statistics on Tab Delimited File

by Paragod28 (Novice)
on Dec 10, 2009 at 20:06 UTC ( [id://812285]=perlquestion: print w/replies, xml ) Need Help??

Paragod28 has asked for the wisdom of the Perl Monks concerning the following question:

I am attempting to find the statistical values on certain columns in a tab delimited file. Here is an example:
contig1 test1 1e-28 28 55
contig1 test2 1e-10 22 54
contig2 test1 1e-10 24 78
contig3 test2 10 78 57
contig4 test3 1e-5 200 55
contig4 test2 10 100 43

I wanted to parse the file with the main key being column2 and find the median for columns 3,4, and 5 that the matches each entry for column 2. So for column 2 I need to find all of the matches in the file and store their frequency, along with the median values for column 3,4,5. I have attempted to use hashes but that just confused me. Any help would be appreciated. Here is my code for frequency and median but I need to combine them with the rest of the data so I can output it back to a tab delimited file.

use strict; use warnings; use Cwd; use List::Compare; use List::AllUtils qw(:all); my $ref_filelist = $ARGV[0]; my ($lable1, $line, $i, $contig, $accession, $organism, $eval, $con_le +ngth, $map_length); open(FILELIST, $ref_filelist ) or die "Could not open Reference filelist...($!)"; my %count; while (<FILELIST>){ my @tab_list =( $contig, $accession, $organism, $eval, $con_length +, $map_length ) = split ( '\t',); ++$count{$organism}; } foreach (keys(%count)) { print "$_: $count{$_}\n"; #####Sub not used in code yet##### sub medianeval{ my $ref_filelist_1 = $ARGV[0]; my ($lable1, $line, $i, $contig, $accession, $organism, $eval, $con_le +ngth, $map_length); open(FILELIST1, $ref_filelist_1 ) or die "Could not open Reference filelist...($!)"; my %table; while ($line = <FILELIST1>) { chomp $line; my @tab_list = ( $contig, $accession, $organism, $eval, $con_lengt +h, $map_length ) = split ( '\t', $line ); $table{$organism} = [] unless exists $table{$organism}; push @{$table{$organism}}, $eval; } foreach $organism (sort keys %table) { print "$organism: "; my @eval = @{$table{$organism}}; my @eval_ref = sort {$a <=> $b } @eval; my $median = $eval_ref[($#eval_ref / 2)]; print "$median\n"; } }
I know this is a mess! I am still learning. Thanks

Replies are listed 'Best First'.
Re: Statistics on Tab Delimited File
by toolic (Bishop) on Dec 10, 2009 at 20:51 UTC
    The following will calculate and print out the median value of each column (3-5), for each 'test' (column 2). This loops through your input file once (it looks like you loop twice). If this is not what you are looking for, please also include your expected output. The median function is from Acme::Tools. See also perldsc.
    use strict; use warnings; use Acme::Tools; my %data; while (<DATA>) { my ($test, @vals) = (split)[1..4]; my $col = 3; for my $val (@vals) { push @{ $data{$test}{$col} }, $val; $col++; } } #use Data::Dumper; print Dumper(\%data); for my $test (sort keys %data) { for my $col (sort keys %{ $data{$test} }) { my $med = median(@{ $data{$test}{$col} }); print "test=$test, col=$col, med=$med\n"; } } __DATA__ contig1 test1 1e-28 28 55 contig1 test2 1e-10 22 54 contig2 test1 1e-10 24 78 contig3 test2 10 78 57 contig4 test3 1e-5 200 55 contig4 test2 10 100 43
    Prints out:
    test=test1, col=3, med=5e-11 test=test1, col=4, med=26 test=test1, col=5, med=66.5 test=test2, col=3, med=10 test=test2, col=4, med=78 test=test2, col=5, med=54 test=test3, col=3, med=1e-5 test=test3, col=4, med=200 test=test3, col=5, med=55
    I know this is a mess!
    Let perltidy clean it up for you. It also points out that your code has compile errors. You probably didn't intend to place the medianeval sub inside that foreach loop.

      Thank you so much! I have been working on a solution all week. That is near what I was looking for but I did not explain it well. That is my fault.

      __DATA__
      Contig Organism Eval Length MappedLength
      contig1 test1 1e-28 28 55
      contig1 test2 1e-10 22 54
      contig2 test1 1e-10 24 78
      contig3 test2 10 78 57
      contig4 test3 1e-5 200 55
      contig4 test2 10 100 43
      I am trying for this output (math may not be correct for median but frequency is correct):
      Organism Frequency EvalMedian LengthMedian MappedMedian
      test2 3 5 38 47
      test1 2 1e-10 24 54
      test3 1 1e-5 200 55

      The "Frequency being how many time I see the organism in the file. I then take all of the values when I hit multiple times and find the median of all the values combined for that particular match (test1, test2 etc). If the "Organism" does not have a match the median values are the same as found.

      I see that I did not get column one[0] in order but that does not matter for the final output. "test1" will actually be long scientific names.

      Thanks
Re: Statistics on Tab Delimited File
by eye (Chaplain) on Dec 10, 2009 at 22:31 UTC
    Using libraries (like Acme::Tools) is not only efficient for coding, it can be a good way to gain the benefit of technical expertise you may not have. I mention this because the OP's code to compute median is fundamentally flawed (median is computed differently for even and odd numbers of items).

    For more details on median, see MathWorld.

      Yes I know the median routine is flawed. I was not receiving appropriate outputs against values such as 1e-33 so I decided to sort and cut the list in half and take that number. It is not accurate but close enough for my purposes. I am just trying to get an idea on organism distribution on sequence samples. If anyone knows a way to output the median of scientific notation (evalues)it would greatly help. Thanks again.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://812285]
Approved by zwon
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others goofing around in the Monastery: (8)
As of 2024-04-19 15:08 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found