Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation

Regular expressions

by lairel (Novice)
on Oct 26, 2015 at 18:42 UTC ( #1146022=perlquestion: print w/replies, xml ) Need Help??

lairel has asked for the wisdom of the Perl Monks concerning the following question:

I am trying to grasp the regular expressions and various uses for them. I am working on a problem where I am given a string of letters and have to use a regular expression to find the sequences that starts with ATG and ends with TAG, TAA, or TGA. I am having trouble figuring out the regular expression that would search for each of these endings in a single expression. Here is what I have so far

#!/usr/bin/perl use strict; use warnings; use diagnostics; #insert sequence my $seq = 'AATGGTTTCTCCCATCTCTCCATCGGCATAAAAATACAGAATGATCTAACGAA'; #find codons while ($seq =~ m/ATG(.*)(TAG|TAA|TGA)/g){ #print codons print $1, "\n"; }

but I am not getting the correct output, instead getting that the $1 is unspecified. Any suggestions? I would really like to understand how this sort of regular expression works. Thank you!

Replies are listed 'Best First'.
Re: Regular expressions
by stevieb (Canon) on Oct 26, 2015 at 18:51 UTC

    You're pretty close. I think the problem is that you're being 'greedy' in the regex. I've added a ? after the .* to make it non-greedy. The ?: tells the inner parens to not capture. The outer parens will do this for us.

    This example captures everything including the delimiters:


    This one will capture only inside the delimiters:


      this may be a stupid question, but what is the function of the :

        As I said, the ?: makes it so the group within the () is not captured. Observe...

        Without ?::

        perl -E '"123" =~ /(2|3)/; say $1' 2

        With ?::

        perl -E '"123" =~ /(?:2|3)/; say $1'

        Note how using the ?: doesn't put anything into the special numbered $1 variable. See perlretut's Non-Capturing Groupings

Re: Regular expressions
by kennethk (Abbot) on Oct 26, 2015 at 19:42 UTC
    First, I do not replicate your stated challenge. You say "that the $1 is unspecified", but when I get your posted code, I get:
    which, while it is not correct, is not unspecified. Am I misunderstanding your statement, or are you seeing something different from your code? Make sure that your examples match up to the issues you are encountering.

    If I run stevieb's solution, I get

    which would seem to meet your spec. The bigger question is what happens for nested cases? What is your expected output for
    my $seq = 'ATGATGTGATGA';
    Also, I note a reference to codons, which implies that your tests should be considering a stride of 3 rather than an arbitrary position. Does this matter for your case?

    #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.

      which, while it is not correct, is not unspecified.
      Well, I understand your point, but is it really incorrect? After all this sequence is preceded by ATG and followed by TAA, so it is in a certain way correct. But it is clearly not the smallest sequence matching this criteria in the input string.

      This to say that, while:

      DB<2> print "$1\n" while ($seq =~ m/(ATG(?:.*?)(TAG|TAA|TGA))/g); ATGGTTTCTCCCATCTCTCCATCGGCATAA ATGATCTAA
      seems to probably give a better answer, it is not completely clear whether the
      is a valid sequence or not.
      Also, I note a reference to codons, which implies that your tests should be considering a stride of 3 rather than an arbitrary position.

      This is an excellent point. For the benefit of the OP, here is one way to ensure that only codon-sequences are captured:

      #! perl use strict; use warnings; my $seq = 'AATGGTTTCTCCCATCTCTCCATCGGCATAAAAATACAGAATGATCTAACGAA'; # Adapted from the regex by stevieb my $re = qr{ ( # capture each sequence: ATG # - which begins with the codon ATG (?: [ACGT]{3} )*? # - followed by the smallest number of + codons (?: TAG | TAA | TGA ) # - and ending with the codon TAG, TAA +, or TGA ) }x; print "$1\n" while $seq =~ /$re/g;

      (This assumes that only minimal sequences are wanted — an assumption which should be clarified, as Laurent_R has pointed out, above.)

      Hope that helps,

      Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

        I would have organized the code slightly differently, factoring each of the pattern elements into a separate  qr// regex object and combining them together (inside a capture group) in the final  m// match:

        c:\@Work\Perl\monks>perl -wMstrict -le "my $codon = qr{ [ACGT]{3} }xms; my $start = qr{ ATG }xms; my $end = qr{ TAG | TAA | TGA }xms; ;; my $seq = 'AATGGTTTCTCCCATCTCTCCATCGGCATAAAAATACAGAATGATCTAACGAA'; ;; print qq{'$1'} while $seq =~ m{ ($start $codon*? $end) }xmsg; " 'ATGGTTTCTCCCATCTCTCCATCGGCATAA' 'ATGATCTAA'
        Separate  qr// definitions ease maintenance and, if variable names be wisely chosen, are self-commenting. If possible, I only use capture groups in the final  m// match due to the confusion that trying to count nested capture groups can produce.

        Give a man a fish:  <%-{-{-{-<

Re: Regular expressions
by Discipulus (Abbot) on Oct 27, 2015 at 07:58 UTC
    You got good advice and replies..

    only my little tip; as you are playing with regular expressions you definitively need some playfield: see this post to have 3 useful tools at your disposal.

    There are no rules, there are no thumbs..
    Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.
Re: Regular expressions
by ww (Archbishop) on Oct 28, 2015 at 18:13 UTC

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1146022]
Front-paged by Corion
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chanting in the Monastery: (1)
As of 2023-01-29 23:10 GMT
Find Nodes?
    Voting Booth?

    No recent polls found