http://www.perlmonks.org?node_id=786192

newbio has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

Program Input:

Following are of interest: **carboxypeptidase** protein $$inhibitor$$ ( **CI** ) , **nanopeptidase** kinase $$inhibitor$$ , **NI** , and others such as , **p(57)** and **polypeptidase** protein $$inhibitor$$ ( **PI** ).

Program Output:

1. Following are of interest: **carboxypeptidase_protein_inhibitor_(CI)** , **nanopeptidase_kinase_inhibitor_(NI)** and others such as , **p(57)** and **polypeptidase_protein_inhibitor_(PI)**.

2. Following are of interest: **carboxypeptidase** protein $$inhibitor$$ ( **CI** ) , nanopeptidase kinase inhibitor , NI , and others such as , p(57) and polypeptidase protein inhibitor ( PI ).

3. Following are of interest: carboxypeptidase protein inhibitor ( CI ) , **nanopeptidase** kinase $$inhibitor$$ , **NI** , and others such as , p(57) and polypeptidase protein inhibitor ( PI ).

4. Following are of interest: carboxypeptidase protein inhibitor ( CI ) , nanopeptidase kinase inhibitor , NI , and others such as , p(57) and **polypeptidase** protein $$inhibitor$$ ( **PI** ).

While I can achieve output 1. using the regular expression substitution as shown below, I cannot figure out how output sentences 2,3 and 4 could be achieved.

if ($line =~ /\*\*([^\*]+)\*\*\s(kinase|isoform|protein|peptide|li +gand)\s\$\$([^\$]+)\$\$\s[\(\,]\s\*\*([^\*]+)\*\*\s[\)\,]/) { $line =~ s/\*\*([^\*]+)\*\*\s(kinase|isoform|protein|peptide|l +igand)\s\$\$([^\$]+)\$\$\s[\(\,]\s\*\*([^\*]+)\*\*\s[\)\,]/**$1_$2_$3 +_($4)**/g; print WF "$line\n"; }

While output sentence 1 represents the original sentence with all substitutions using the above code (there are 3 substitutions in this example although this number can vary with the sentence).

Each of the other remaining output sentences (e.g. 2,3 and 4) are the original input sentence, except that, the original pattern is retained in the sentence at the substitution location, while the tags in the sentence (i.e. ** and $$) are removed from all other places in the sentence. The number of such output sentences thus will be equal to the number of patterns substituted using the regex above (which is 3 in this example because there are 3 pattern substituted as shown in output 1.). Is there a nice way of doing this (getting outputs 2,3 and 4)?

Appreciate your help.

Thanks very much in advance.

Replies are listed 'Best First'.
Re: regex pattern match problem
by JadeNB (Chaplain) on Aug 05, 2009 at 20:57 UTC
    Ouch, those regexes hurt! Let's try some /x goodness, using repeaters in the process, noting that ( and , don't need to be escaped in a character class (indeed, a , doesn't need an escape anywhere in a regex):
    $line =~ s/ \*{2} ( [^\*]+ ) \*{2} \s ( kinase | isoform | protein | peptide | ligand ) \s \${2} ( [^\$]+ ) \${2} \s [(,] \s \*{2} ( [^\*]+ ) \*{2} \s [),] /**$1_$2_$3_($4)**/gx
    OK, now that that can be read, let's factor out the common ground:
    sub delimited { my ( $delimiter ) = @_; my $qdelimiter = quotemeta $delimiter; return qr/ $qdelimiter{2} ( [^$qdelimiter]+ ) $qdelimiter{2} /x; } sub balanced { my ( $inside ) = @_; return qr/ [(,] $inside [),] /x; } my $stars = delimited '*'; my $dollars = delimited '$'; my $words = qr/( kinase | isoform | protein | peptide | ligand )/x; my $parens = balanced qr/ \s $stars \s /x; $line =~ s/ $stars \s $words \s $dollars \s $parens /**$1_$2_$3_($4)** +/gx;
    Note that this will match fields delimited like ( field , or , field ), which you probably don't want.

    UPDATE 1: I also stripped out the if ( MATCH ) logic, because a substitution s/OLD/NEW/ is just a no-op if OLD doesn't match.
    UPDATE 2: Changed formatting and corrected a few errors in the code.
    UPDATE 3: Again.

Re: regex pattern match problem
by dwm042 (Priest) on Aug 05, 2009 at 20:09 UTC
    I have to ask: do you have to use a regular expression, or could you just split the string on ":" and "," (maybe also "and" and "."), strip the resulting pieces of all 'junk' and then reconstitute the strings as needed using printf statements?

    David M.
Re: regex pattern match problem
by Polyglot (Chaplain) on Aug 05, 2009 at 19:57 UTC
    It looks to me like you need to either preserve the original sentence and run a global substitution on it once for the sentence #1 output, and then run again with a non-global substitution for each of the other sentences, addressing each match one at a time; or else put a function in the replacement side of the substitution and do your processing within that function.

    Blessings,

    ~Polyglot~

      Yes, Polyglot, that's what I want. Could you please elaborate on your suggestion. How do I perform the non-global substitutions for their corresponding positions? It is possible that exactly the same pattern may get repeated in a sentence, so their positions need to be taken into account, possibly using substr function. Still, I cannot figure out how that works. I tried to do something like this but it does not work.

      $string=$line="Input sentence"; if ($line =~ /\*\*([^\*]+)\*\*\s(kinase|isoform|protein|peptide|li +gand)\s\$\$([^\$]+)\$\$\s[\(\,]\s\*\*([^\*]+)\*\*\s[\)\,]/) { print "yes"; while ($line =~ /(\*\*([^\*]+)\*\*\s(kinase|isoform|protein|pe +ptide|ligand)\s\$\$([^\$]+)\$\$\s[\(\,]\s\*\*([^\*]+)\*\*\s[\)\,])/g) + { $pattern=$1; $string =~ s/(\*\*([^\*]+)\*\*\s(kinase|isoform|protein|pe +ptide|ligand)\s\$\$([^\$]+)\$\$\s[\(\,]\s\*\*([^\*]+)\*\*\s[\)\,])/pq +rstuv/; $string =~ s/\*\*//g; $string =~ s/\$\$//g; $string =~ s/p +qrstuv/$pattern/; print WF "$string\n"; $string=$line; } }
Re: regex pattern match problem
by newbio (Beadle) on Aug 06, 2009 at 15:16 UTC
    Thanks a lot dear Monks for your replies. But, still no clue on how to get output 2,3 and 4 yet? Please guide.
      Basically, you might consider something like this:
      our $line='Your chemical info here...'; #BASICALLY, MAKE THE SUB DO WHATEVER YOU NEED #THEN PUT THE SUB IN THE REPLACEMENT SIDE OF #THE REGEX. my $ProcessMultiLine = sub { #START OF SUB my $templine = $line; my $temppart = ''; my $returnval = ''; while ($templine =~s/( YourMatchingRegexHere )//) { $temppart = $1; $temppart =~ s/ YourSubsMatch / YourSubsReplace /; $returnval .= $temppart; } #END WHILE return $returnval; } #END SUB $line =~ s/ MatchTheString / $ProcessMultiLine -> () /e;

      Blessings,

      ~Polyglot~