BioGeek has asked for the wisdom of the Perl Monks concerning the following question:

Hi monks,

I've been struggling with this one for some time now, but I can't seem to get it to work. I wan't to convert a list of rankings into a list of rankratio's. I have stored into a MySQL database the following information: gene names, and rankings for three different techniques.

+-----------------+-------------+---------------+-------------+ | gene_name | g2d_ranking | pocus_ranking | dgp_ranking | +-----------------+-------------+---------------+-------------+ | ENSG00000179050 | | | 4 | | ENSG00000124183 | 237 | | 2 | | ENSG00000179033 | | | 7 | | ENSG00000124181 | | 13 | 1 | | ENSG00000174306 | | | 5 | | ENSG00000132793 | | | 8 | | ENSG00000183798 | | | 6 | | ENSG00000124177 | | | 3 | | ENSG00000087530 | 153 | | | | ENSG00000185513 | 258 | | | | ENSG00000101052 | 231 | | | | ENSG00000124249 | 154 | 7 | | | ENSG00000101098 | 159 | | | | ENSG00000166913 | 141 | | | +-----------------+-------------+---------------+-------------+
Step 1: for all the collumns, the lowest value should get assigned rank 1, the second lowest value 2, etc.
Step 2: count the number of rankings in each column (is eaqual to 'find the ranking with the highest value').
+-----------------+-------------+---------------+-------------+ | gene_name | g2d_ranking | pocus_ranking | dgp_ranking | +-----------------+-------------+---------------+-------------+ | ENSG00000179050 | | | 4 | | ENSG00000124183 | 6 | | 2 | | ENSG00000179033 | | | 7 | | ENSG00000124181 | | 2 | 1 | | ENSG00000174306 | | | 5 | | ENSG00000132793 | | | 8 | | ENSG00000183798 | | | 6 | | ENSG00000124177 | | | 3 | | ENSG00000087530 | 2 | | | | ENSG00000185513 | 7 | | | | ENSG00000101052 | 5 | | | | ENSG00000124249 | 3 | 1 | | | ENSG00000101098 | 4 | | | | ENSG00000166913 | 1 | | | +-----------------+-------------+---------------+-------------+ count: 7 2 8
Step 3: for each cell, divide the ranking by the 'count of rankings for that column', as to get rankratio's.
+-----------------+-------------+---------------+-------------+ | gene_name | g2d_ranking | pocus_ranking | dgp_ranking | +-----------------+-------------+---------------+-------------+ | ENSG00000179050 | | | 0.500 | | ENSG00000124183 | 0.857 | | 0.250 | | ENSG00000179033 | | | 0.875 | | ENSG00000124181 | | 1.000 | 0.125 | | ENSG00000174306 | | | 0.625 | | ENSG00000132793 | | | 1.000 | | ENSG00000183798 | | | 0.750 | | ENSG00000124177 | | | 0.375 | | ENSG00000087530 | 0.285 | | | | ENSG00000185513 | 1.000 | | | | ENSG00000101052 | 0.714 | | | | ENSG00000124249 | 0.428 | 0.500 | | | ENSG00000101098 | 0.714 | | | | ENSG00000166913 | 0.142 | | | +-----------------+-------------+---------------+-------------+


I've various code snippets lying around that do some part of the problem, but can't manage to get them into one working construct. Either it's a "illegal division by zero problem" (a column can have no elements), or then I get a "Use of uninitialized value" warning because I try to divide an undefined value.

So, I wanted to get some fresh insigths, and see how the other monasterians would tackle this problem. Hope you can put me in the rigth direction.

Update: Fixed typo (s/1.857/1.000/).

Replies are listed 'Best First'.
Re: converting list of rankings into list of rankratio's.
by johnnywang (Priest) on Aug 16, 2004 at 03:20 UTC
    via DBI you can easily get a list of original rankings for a particular column. The following code gives a mapping of the original ranking to the new ratio ranking (by the way, your result in the g2d_ranking column doesn't seem to be correct: there is a ratio of 1.857)
    use strict; my @old = (237,153,258,231,154,159,141); my %unique = map {$_ =>1} @old; #just in case the old rankins are not +unique my @ordered = sort keys %unique; my %map = (); for my $i (0..$#ordered){ $map{$ordered[$i]} = $i+1; } foreach my $i (@old){ print $i, "=>",$map{$i}/scalar(@ordered),"\n"; } __END__ 237=>0.857142857142857 153=>0.285714285714286 258=>1 231=>0.714285714285714 154=>0.428571428571429 159=>0.571428571428571 141=>0.142857142857143
Re: converting list of rankings into list of rankratio's.
by Zaxo (Archbishop) on Aug 16, 2004 at 03:29 UTC

    What data structure do you have the data in? That will determine the details of how to do this.

    Lets suppose you can get a column as a list (not necessarily an array). You can grep to remove db NULL's, which are undef in perl, and use references to the data to do something like this.

    # $data is a hashref, a standin for the data struct you have my @ranks = \(grep {defined} values %$data); my $idx = 0; $$_ = ++$idx for sort {$$a <=> $$b} @ranks; # step 1 & 2 $$_ /= $idx for @ranks; # step 3
    The values are changed in place in the original data structure because we work with references to those locations. Division by zero is avoided because for over an empty list never does a division.

    After Compline,
    Zaxo

Re: converting list of rankings into list of rankratio's.
by BrowserUk (Pope) on Aug 16, 2004 at 03:53 UTC

    If you wanted them back into their original order, you have another step to make.

    #! perl -slw use strict; my @data = map{ chomp; [ split '\s*\|\s*', $_ ] } <DATA>; my @counts; for my $field ( 1 .. 3 ) { my $rank = 0; @data = map { $_->[ $field ] and $_->[ $field ] = ++$rank; $_; } sort { ( $a->[ $field ]||0 ) <=> ( $b->[ $field ]||0 ) } @data; $counts[ $field ] = $rank; } for my $data ( @data ) { map { $data->[ $_ ] and $data->[ $_ ] = sprintf '%5.3f', $data->[ $_ ] /= $counts[ $_ ]; } 1 .. 3; } print join "\t", @$_ for @data; __DATA__ ENSG00000179050 | | | 4 ENSG00000124183 | 237 | | 2 ENSG00000179033 | | | 7 ENSG00000124181 | | 13 | 1 ENSG00000174306 | | | 5 ENSG00000132793 | | | 8 ENSG00000183798 | | | 6 ENSG00000124177 | | | 3 ENSG00000087530 | 153 | | ENSG00000185513 | 258 | | ENSG00000101052 | 231 | | ENSG00000124249 | 154 | 7 | ENSG00000101098 | 159 | | ENSG00000166913 | 141 | | __OUTPUT__ P:\test>383173 ENSG00000166913 0.143 ENSG00000087530 0.286 ENSG00000101098 0.571 ENSG00000101052 0.714 ENSG00000185513 1.000 ENSG00000124249 0.429 0.500 ENSG00000124181 1.000 0.125 ENSG00000124183 0.857 0.250 ENSG00000124177 0.375 ENSG00000179050 0.500 ENSG00000174306 0.625 ENSG00000183798 0.750 ENSG00000179033 0.875 ENSG00000132793 1.000

    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "Think for yourself!" - Abigail
    "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon
Re: converting list of rankings into list of rankratio's.
by jdporter (Canon) on Aug 16, 2004 at 03:17 UTC
    I'm not sure, but my module, Statistics::SGT (Simple Good-Turing algorithm), might be applicable here.
      I'm going to feed the output to a recursive formula, used in order statistics. (Therefore the rankings: the rank of the values is important for me)

      Update:Upon further reading about your module, it does seem quite interessting. Will give it a try.
Re: converting list of rankings into list of rankratio's.
by hv (Parson) on Aug 16, 2004 at 10:54 UTC

    I'd do it something like this:

    my @genes; # I assume these come from somewhere for my $style (qw/ g2d pocus dgp /) { my($in, $out) = ($style . "_ranking", $style . "_rankratio"); # step 1: locate and sort the relevant genes for this ranking styl +e my @set = sort { $a->{$in} <=> $b->{$in} } grep $_->{$in}, @genes; # step 2: get a count my $total = @set; # step 3: update the located genes with a rankratio for my $index (0 .. $#set) { $set[$index]{$out} = ($index + 1) / $total; } }

    Notes: I select the genes that have a particular ranking specified by checking $_->{$in} in a boolean context, so that it'll be false for zero or undef without warnings; I never store the intermediate re-based ranking as it is implicit in the order of @set; I only loop over the selected genes to update them, so I won't try to update an undef value, nor (since there would be zero loop iterations) will this try to divide by zero.

    This code is untested.

    Hugo