Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

compare values while value on other column remains the same

by linseyr (Acolyte)
on Sep 21, 2012 at 19:07 UTC ( #994977=perlquestion: print w/ replies, xml ) Need Help??
linseyr has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I wrote a script to compare values on two columns, but I only want to do this while the id on the 4th column remains the same. The file looks like this:
chr1 15865 15915 cg13869341 908 913 chr1 18827 18877 cg14008030 688 776 chr1 29407 29457 cg12045430 43 70 chr1 29407 29457 cg12045430 43 88 chr1 29407 29457 cg12045430 43 16 chr1 29425 29475 cg20826792 57 70 chr1 29425 29475 cg20826792 57 88 chr1 29425 29475 cg20826792 57 16 chr1 29435 29485 cg00381604 33 70 chr1 29435 29485 cg00381604 33 88 chr1 29435 29485 cg00381604 33 16 chr1 68849 68899 cg20253340 560 593 chr1 69591 69641 cg21870274 791 809 chr1 91550 91600 cg03130891 55 84
So for example I want to compare line 3,4,5 because they have the same id in column 4. My code looks like this now:
open(intersectMethyl,'<',"intersect148methyl.bed") or die $!; my @methylData = <intersectMethyl>; close(intersectMethyl); my @methylScore; my @ids; for my $line(@methylData){ my($chr,$start,$end,$id,$scoreNormal,$scoreTumor) = split("\t",$li +ne); push(@ids, $id); for ($i = 1; $i<eof; $i++){ if (@ids[$i]==@ids[$i-1] && (($scoreTumor/1000) - ($scoreNormal +/1000) >= 0.2)){ push(@methylscore, "$chr $start $end $id $scoreNor +mal $scoreTumor") } } } open(RESULTS,'>','methylDifferences.txt') or die $!; foreach(@methylscore){ print RESULTS "$_\n"; } close(RESULTS);
But this doesn't work.. Could somebody help me? Thanks!

Comment on compare values while value on other column remains the same
Select or Download Code
Re: compare values while value on other column remains the same
by choroba (Abbot) on Sep 21, 2012 at 19:39 UTC
    What output do you expect from the given sample data? I do not understand the specification.

    Note that eof returns 0 or 1 if used as a number, so $i < eof will not do what you think.

      Actually, my output would be:
      $chr $begin $end $id $average
      But I just wanted to start with calculating. I would like to calculate the average difference between the 2 columns (5 and 6) for the same id. So for line 3,4,5 I would like to calculate the average difference between column 5 and 6. And for the lines for which only one line has that id, like line 1 I just want to calculate the difference between column 5 and 6, since there is no average to calculate. I only want to output the lines that have a average difference > 0.2
Re: compare values while value on other column remains the same
by Anonymous Monk on Sep 21, 2012 at 21:11 UTC
    Something like this?
    use strict; use warnings; use List::Util qw(sum); my %table; while (defined(my $line = <DATA>)) { my @values = split(/\s+/, $line); push @{$table{$values[3]}}, \@values; } foreach my $key (sort keys %table) { my $avg_diff = sum(map { ($_->[5] / 1000) - ($_->[4] / 1000) } @{$ +table{$key}}) / @{$table{$key}}; foreach my $array_ref (@{$table{$key}}) { if ($avg_diff >= 0.02) { print join("\t", @{$array_ref}), "\n"; } } } __DATA__ chr1 15865 15915 cg13869341 908 913 chr1 18827 18877 cg14008030 688 776 chr1 29407 29457 cg12045430 43 70 chr1 29407 29457 cg12045430 43 88 chr1 29407 29457 cg12045430 43 16 chr1 29425 29475 cg20826792 57 70 chr1 29425 29475 cg20826792 57 88 chr1 29425 29475 cg20826792 57 16 chr1 29435 29485 cg00381604 33 70 chr1 29435 29485 cg00381604 33 88 chr1 29435 29485 cg00381604 33 16 chr1 68849 68899 cg20253340 560 593 chr1 69591 69641 cg21870274 791 809 chr1 91550 91600 cg03130891 55 84
      Hi, Thank you for your answer. I want to do something slightly different now, and Im trying to do it with the code you provided, but since Im all new with this, the hashes and maps are a bit difficult to me.. I want to do it at a different order, because I want to do something else with it. I already combined multiple data files into one which now looks like this:
      cg12045430 43 16 6 44 23 63 24 77 cg12045430 43 16 6 44 23 63 24 80 cg12045430 43 16 6 44 23 63 24 45 cg12045430 43 16 6 44 23 115 24 77 cg12045430 43 16 6 44 23 115 24 80 cg12045430 43 16 6 44 23 115 24 45 cg12045430 43 16 6 44 23 29 24 77 cg12045430 43 16 6 44 23 29 24 80 cg12045430 43 16 6 44 23 29 24 45
      So I have again the same id's on multiple lines and different values on columns 2-9. I want for one ID get the average of columns 2-9. So for that input file that would be:
      cg12045430 43 16 6 44 23 69 24 67
      And afterwards I want to calculate the difference between column 2,3 and 4,5 and 6,7 and 8,9 like before. I tried to do it using your code and for every key calculate the average of the map, but it didn't work. Could you give me some advise? Thanks!
Re: compare values while value on other column remains the same
by aaron_baugher (Deacon) on Sep 22, 2012 at 02:08 UTC

    In your sample, the first three fields always match when the fourth (ID) field matches. If that's always the case, you could concat those fields into your key. If it's not, you might need to save the lines into an array keyed on the ID field, and then print them out. For now, I assumed they'd be identical as in the sample, and just printed out the IDs and averages.

    I also assumed that you wanted to average the absolute differences. In other words, if one was 50 100 and the next was 100 50, those would average a difference of 50, not zero (from averaging 50 and -50). I also assumed from your code that you want to divide the numbers by 1000. When I did that, none of the averages exceeded 0.2, so I printed all greater than 0.02 so I'd have some output. Hopefully this will give you some ideas:

    #!/usr/bin/env perl use Modern::Perl; sub sum { my $t = shift; return $t unless @_; return $t + sum( @_); } my %k; while(<DATA>){ chomp; my @w = split; next unless @w == 6; push @{$k{$w[3]}}, abs($w[4]-$w[5])/1000; } for (keys %k){ my $avg = sum(@{$k{$_}})/@{$k{$_}}; say "$_ $avg" if $avg > 0.02; } __DATA__ chr1 15865 15915 cg13869341 908 913 chr1 18827 18877 cg14008030 688 776 chr1 29407 29457 cg12045430 43 70 chr1 29407 29457 cg12045430 43 88 chr1 29407 29457 cg12045430 43 16 chr1 29425 29475 cg20826792 57 70 chr1 29425 29475 cg20826792 57 88 chr1 29425 29475 cg20826792 57 16 chr1 29435 29485 cg00381604 33 70 chr1 29435 29485 cg00381604 33 88 chr1 29435 29485 cg00381604 33 16 chr1 68849 68899 cg20253340 560 593 chr1 69591 69641 cg21870274 791 809 chr1 91550 91600 cg03130891 55 84

    Aaron B.
    Available for small or large Perl jobs; see my home node.

Re: compare values while value on other column remains the same
by CountZero (Bishop) on Sep 22, 2012 at 12:16 UTC
    I turned your data into a tab-separated CSV file and added a line of headers (fieldnames). Then I could use DBI to attack your problem with SQL.
    use Modern::Perl; use DBI; my $dbh = DBI->connect( "dbi:CSV:", "", "", { f_dir => './DB', csv_sep_char => "\t", } ) or die "Cannot connect: $DBI::errstr"; my $sql = 'SELECT id, count(*) AS number, sum(tumor) AS tumorscore, sum(normal) +AS normalscore, sum(tumor - normal) AS difference, avg(tumor - normal +) AS averagedifference FROM methyl WHERE (tumor - normal) <=20 or (tu +mor - normal) >=20 GROUP BY id ORDER BY id'; my $sth = $dbh->prepare($sql); $sth->execute; while ( my $row = $sth->fetchrow_hashref ) { say "id: $row->{'id'}, Count: $row->{'number'}, Score tumor: $row->{'tumor +score'}, Score normal: $row->{'normalscore'}, Difference: $row->{'dif +ference'}, Average Difference: $row->{'averagedifference'}"; }
    Results:
    id: cg00381604, Count: 3, Score tumor: 99, Score normal: 174, Differen +ce: -75, Average Difference: -25 id: cg03130891, Count: 1, Score tumor: 55, Score normal: 84, Differenc +e: -29, Average Difference: -29 id: cg12045430, Count: 3, Score tumor: 129, Score normal: 174, Differe +nce: -45, Average Difference: -15 id: cg13869341, Count: 1, Score tumor: 908, Score normal: 913, Differe +nce: -5, Average Difference: -5 id: cg14008030, Count: 1, Score tumor: 688, Score normal: 776, Differe +nce: -88, Average Difference: -88 id: cg20253340, Count: 1, Score tumor: 560, Score normal: 593, Differe +nce: -33, Average Difference: -33 id: cg20826792, Count: 3, Score tumor: 171, Score normal: 174, Differe +nce: -3, Average Difference: -1 id: cg21870274, Count: 1, Score tumor: 791, Score normal: 809, Differe +nce: -18, Average Difference: -18
    To exclude as much as possible any rounding-of errors I did not divide the figures by 1000.

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

    My blog: Imperial Deltronics

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://994977]
Approved by bulk88
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chanting in the Monastery: (5)
As of 2014-12-21 15:38 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (106 votes), past polls