Hey Monks, I have a page, which has the following format:
It is a list with (biological) "candidates".
For each candidate, a bunch of information is given, like protein name (in square brackets), a R-score and a GO-score.
For each candidate, I'm interested in it's R-score and protein name. The tricky part is the variable number of occurences of the protein name.

The information can have one of the following formats.

1) R-score and protein name.
DNA (Note) PROTEIN: <A HREF = http://www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?cmd=Retrieve&db=Protein&dopt=GenPept&list_uids=5729915>NP_006601</A>
R-score = 0.002033; GO-score = 0.000618

So I want to fetch "NP_006601" and "0.002033".

2) R-score and more than one protein name.
DNA (Note) PROTEIN: <A HREF = http://www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?cmd=Retrieve&db=Protein&dopt=GenPept&list_uids=12738831>NP_075389</A> <A HREF = http://www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?cmd=Retrieve&db=Protein&dopt=GenPept&list_uids=12738834>NP_075390</A>
R-score = 0.006971; GO-score = 0.000458

So I want to fetch "NP_075389", "NP_075390" and "0.006971".

3) R-score and no protein name.
DNA (Note)
R-score = 0.001743; GO-score = 0.000618

So I want to fetch "0.001743".

The code I wrote fetches the R-scores, but I can't come up with a code that matches all possible cases of the protein name.
use strict; use warnings; use LWP; my $R_score = (); my @protein = (); my @R_score_protein = (); $browser = LWP::UserAgent->new(); my $resp = $browser->get(http://www.bork.embl-heidelberg.de +/g2d/list_hits_disease.pl?U57042:Inflammatory_bowel_disease_7); my $content_all = $resp->content(); while ( $content_all =~ m{R\-score<\/A>\s=\s(\d\.\d+)\;}g ) { $R_score = $1; push( @R_score_protein, $R_score );
Update:To improve readability, I've removed the relevant HTML snipets, and placed them on my BioGeek's scratchpad.

In reply to regex with variable input by BioGeek

Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":