BioGeek has asked for the wisdom of the Perl Monks concerning the following question:

Hey monks,

Updating on some chatter yesterday in the CB: I've -again- a page I'm parsing. The parsing goes wel (with, I admit, a big ugly regex that probably could've been written much shorter), but now I want to assign a rank to each match, ie. the first match should get rank 1, the second match rank 2, ... the nth match should get rank n.

I came up with this:
my $ranking_dgp = 0; do { my $gene = $1; my $chromosone = $2; my $score_dgp = $3; $ranking_dgp = $ranking_dgp + 1; push @gene_score, [ $gene, $chromosone, $score_dgp, $ranking_d +gp ]; } while ( $text =~ m{gene\=(ENSG\d+).*?<TD>\s+(\d+|X|Y)\s+</TD>.*? +<TD>.*?<TD>.*?<TD>.*?<TD>\s(\d.\d+)\s\<\/TD\>}gsm); print Dumper @gene_score;

but it's one off: my first match gets rank 2. When I look at my output, I also see and understand why, but I can't think of a better way to do it.

$VAR1 = [ undef, undef, undef, '1' ]; $VAR2 = [ 'ENSG00000165659', '13', '0.738681', 2 ]; $VAR3 = [ 'ENSG00000184226', '13', '0.627447', 3 ];

Replies are listed 'Best First'.
Re: adding rank to a match in while loop.
by japhy (Canon) on Aug 13, 2004 at 14:05 UTC
    It looks like you're just misusing the do BLOCK while EXPR structure. I think you just need to say while (EXPR) BLOCK and you'll be fine.
    _____________________________________________________
    Jeff japhy Pinyan, P.L., P.M., P.O.D, X.S.: Perl, regex, and perl hacker
    How can we ever be the sold short or the cheated, we who for every service have long ago been overpaid? ~~ Meister Eckhart
Re: adding rank to a match in while loop.
by fletcher_the_dog (Friar) on Aug 13, 2004 at 14:40 UTC
    Your first segment will get always undefs or whatever happens to be lying around in $1,$2,$3 because the 'do' in a 'do/while' loop is always executed once before the first time the while expression is tested for a match. You probably want to change this into a normal while loop like this:
    my $ranking_dgp = 0; while ( $text =~ m{gene\=(ENSG\d+).*?<TD>\s+(\d+|X|Y)\s+</TD>.*? <TD>.*?<TD>.*?<TD>.*?<TD>\s(\d.\d+)\s\<\/TD\>}gsm){ my $gene = $1; my $chromosone = $2; my $score_dgp = $3; $ranking_dgp = $ranking_dgp + 1; push @gene_score, [ $gene, $chromosone, $score_dgp, $ranking_d +gp ]; } print Dumper @gene_score;
      Thanks, I first tried a while loop like this, but then all my matches got ranking 1. So that's the reason I begun trying other loop structures. But now it works, thanks.
Re: adding rank to a match in while loop.
by Limbic~Region (Chancellor) on Aug 13, 2004 at 14:44 UTC
    BioGeek,
    I really hate coding anything web related. I force myself to though when I get a scratch, so that I can do it when I really need to. Have a look at:
    #!/usr/bin/perl use strict; use warnings; use HTML::TableContentParser; use HTML::TokeParser::Simple; use WWW::Mechanize; use Data::Dumper; my $mech = WWW::Mechanize->new( autocheck => 1 ); $mech->get( 'http://nexus.ugent.be/jeroen/bc.html' ); my @gene_score; my $table = HTML::TableContentParser->new()->parse( $mech->content() ) +; for my $rank ( 1 .. $#{ $table->[0]{rows} } ) { my $cells = $table->[0]{rows}[$rank]{cells}; push @gene_score, { Rank => $rank, Ensembl_ID => Get_Label( $cells->[0]{data} ), RefSeq_ID => $cells->[1]{data}, Gene_Symbol => $cells->[2]{data}, Band => $cells->[3]{data}, Gene_Name => $cells->[4]{data}, Probability => $cells->[5]{data}, Known_Phenotype => $cells->[6]{data}, }; } sub Get_Label { my $link = shift; my $p = HTML::TokeParser::Simple->new( \$link ); while ( my $token = $p->get_token ) { next if ! $token->is_start_tag; return $token->return_attr( 'href' ) =~ /gene=(\w+)$/; } }
    This can obviously be made less verbose, but I wanted it to be clear where the data was coming from.

    Cheers - L~R

    Added the $table->[0]{rows}[$rank]{cells} shortcut
      Thanks, Limbic Region.

      That ugly regex in my code was written back in those days when I just started learning Perl (read: some three months ago). At that point I still had to figure out what exactly a reference was. But in the meantime I also discovered the HTML module.