Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

Simple regex question. Grouping with a negative lookahead assertion.

by Anonymous Monk
on Jul 14, 2013 at 01:21 UTC ( #1044183=perlquestion: print w/replies, xml ) Need Help??
Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi.

In my code:

#!/usr/bin/perl -w use strict; my $dna = 'atctcggataatgggataaaaatataggctataaatggcgccccggctaattttt'; if ($dna =~ /atg([acgt]+)(?!(taa|tag|tga))/xms) { print $1; }


I am trying to capture the very short string after the 'atg' and before (not including) either 'taa', 'tag', or 'tga'. I think I'm close but my regex is not working as expected.

Thank you monks and monkettes.
  • Comment on Simple regex question. Grouping with a negative lookahead assertion.
  • Download Code

Replies are listed 'Best First'.
Re: Simple regex question. Grouping with a negative lookahead assertion.
by BrowserUk (Pope) on Jul 14, 2013 at 01:39 UTC

    Like this?

    [0] Perl> $dna = 'atctcggataatgggataaaaatataggctataaatggcgccccggctaatt +ttt';; [0] Perl> print $1 while $dna =~ m[atg(.+?)(?=taa|tag|tga)]g;; gga gcgccccggc

    If so, the difference is the use of the non-greedy match quantifier +?


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      That's very close. I was also trying to prevent any characters that were not in the following character class, [acgt], from being included in the match.

      Thanks for the quick response.
        I was also trying to prevent any characters that were not in the following character class, [acgt], from being included in the match.

        Is that a possibility? If so, then substitute that for . in my regex. (S'not rocket science.)

        But, if it is a possibility, then you could (should) have included a non-acgt character in your example.

        And if the example you provided is realistic, then using [acgt] is redundant, because your example consists entirely of those characters.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
        I just modified and tested the code.

        The minor modification will work. I should have seen the necessary addition of the non-greedy mode quantifier.

        Thanks again.
Re: Simple regex question. Grouping with a negative lookahead assertion.
by kcott (Chancellor) on Jul 14, 2013 at 07:18 UTC

    One of your main problems here is deciding that you needed a look-ahead assertion: you don't. (See Look-Around Assertions in perlre - Extended Patterns for details.)

    It's useful to show actual and expected output. Here's what I get:

    $ perl -Mstrict -Mwarnings -E ' my $dna = q{atctcggataatgggataaaaatataggctataaatggcgccccggctaatttt +t}; if ($dna =~ /atg([acgt]+)(?!(taa|tag|tga))/xms) { say $1; } ' ggataaaaatataggctataaatggcgccccggctaattttt

    So, that skips everything until 'atg' is found; after that, as many as possible of [acgt] are captured as long as your rule (must not be followed by taa|tag|tga) is adhered to. The end of the $dna string is not "followed by taa|tag|tga" so the successful match ends there.

    What you really want to do is stop capturing when taa|tag|tga is found. That would be:

    $ perl -Mstrict -Mwarnings -E ' my $dna = q{atctcggataatgggataaaaatataggctataaatggcgccccggctaatttt +t}; if ($dna =~ /atg([acgt]+)(?:taa|tag|tga)/xms) { say $1; } ' ggataaaaatataggctataaatggcgccccggc

    So now, as many as possible of [acgt] are captured until taa|tag|tga is found.

    Furthermore, it looks like you want "as few as possible of [acgt]" instead of "as many as possible of [acgt]":

    $ perl -Mstrict -Mwarnings -E ' my $dna = q{atctcggataatgggataaaaatataggctataaatggcgccccggctaatttt +t}; if ($dna =~ /atg([acgt]+?)(?:taa|tag|tga)/xms) { say $1; } ' gga

    You can clean that up by replacing [acgt] with . (you only have those four letters in $dna and, indeed, in DNA) and removing the three modifiers xms which you make no use of.

    $ perl -Mstrict -Mwarnings -E ' my $dna = q{atctcggataatgggataaaaatataggctataaatggcgccccggctaatttt +t}; if ($dna =~ /atg(.+?)(?:taa|tag|tga)/) { say $1; } ' gga

    I note that the modifiers xms are written in the same (alphabetically) unordered way as they appear throughout Perl Best Practices (PBP). So, either you've just copied those from somewhere else and don't know what they mean (see perlre - Modifiers) or you're required to follow PBP. If the latter, you should use warnings (see also -w in perlrun) and the regular expression would be better as:

    /atg (.+?) (?>taa|tag|tga)/msx

    (?>pattern) is also explained in perlre - Extended Patterns.

    Finally, in your real code, unless you're ensuring that $dna is always lowercase (e.g. by using lc), you should also add the i modifier (also described in perlre - Modifiers).

    -- Ken

      In general, I agree with the points in your reply, but a coupla quibbles...

      ... the modifiers xms ... would be better as:
      /atg (.+?) (?>taa|tag|tga)/msx
      [emphasis added]

      Why better? The modifiers are given in alphabetic order in a  Regexp object stringization, but what is the advantage of keeping that order? I thought TheDamian used the  //xms ordering throughout PBP simply because it happened to be the order in which those modifiers were introduced and discussed in the regex section of the book, not because of any inherent advantage. Is their order not irrelevant to compilation/execution?

      Also, what is the advantage of using atomic grouping for the  (?>taa|tag|tga) stop codon (if that's the right terminology) sub-pattern? My understanding is that the primary (maybe the only?) purpose of atomic grouping is to defeat backtracking in situations in which the programmer knows backtracking will impair performance. In the example regex, once the stop codon pattern matches, the overall match succeeds; there is never any backtracking to defeat. (This is already discussed in part here, but I still don't see any advantage.)

      ... in your real code, unless you're ensuring that $dna is always lowercase (e.g. by using lc), you should also add the i modifier ...

      I would emphasize that it's usually very important to ensure lower- (or canonical-) casing for long string matches because the  //i modifier can impose a significant performance hit. In the benchmarks I did here, just adding  //i to my equivalent of the
          /atg(.+?)(?:taa|tag|tga)/
      regex incurred a 30% - 35% hit.

        "... the modifiers xms ... would be better as:"

        Well, I've seen some things taken out of context in my time but I think this one takes the biscuit. I'm not annoyed; I actually got a bit of a chuckle. I was, however, somewhat surprised.

        Just so we're clear, let me highlight the seven words you pulled out of three sentences in order to get something to quibble about:

        "I note that the modifiers xms are written in the same (alphabetically) unordered way as they appear throughout Perl Best Practices (PBP). So, either you've just copied those from somewhere else and don't know what they mean (see perlre - Modifiers) or you're required to follow PBP. If the latter, you should use warnings (see also -w in perlrun) and the regular expression would be better as:

        My regex following "You can clean that up by ..." was:

        /atg(.+?)(?:taa|tag|tga)/

        That was my solution. I was happy with it. I'm still happy with it.

        I then went on to say that if the OP was "required to follow PBP" then certain other things should be done. These included using warnings and making some changes to my regex. I wasn't advocating blind adherence to PBP and I don't believe anything in my post suggested that.

        As far as the order of the modifiers goes: write them any way you want. PBP typically has xms (and, yes, that's the order in which they are presented in the book); I prefer to write them alphabetically (that's just me); to the best of my knowledge, the regex engine doesn't care what order you use.

        Regarding the (?>pattern) construct, I claimed no advantage to using this. It's just another PBP guideline: "... rewrite any instance of: X | Y as: (?> X | Y )" [truncated extract from page 271].

        Finally, you make a good point about "lower- (or canonical-) casing". I concur.

        -- Ken

Re: Simple regex question. Grouping with a negative lookahead assertion.
by AnomalousMonk (Chancellor) on Jul 14, 2013 at 05:01 UTC

    Just one question: In the sequence  'atgaaaaa' (which is not terminated by any of (taa|tag|tga)), what should be matched? From the discussion in the thread so far, I assume the answer is 'nothing'.

    With that assumption in hand, here's a small variation on BrowserUk's approach, which is easily adapted to capture all kinds of info about each match. This needs Perl version 5.10+ for  ${^MATCH} and  \K and the  //p regex modifier. If only the matching sub-sequences are needed, it can capture directly to an array. Because it does not use capture groups, it may be slightly faster, but I have not Benchmark-ed this.

    >perl -wMstrict -le "my $dna = 'atctcggataatgggataaaaatataggctataaatggcgccccggctaattttt'; ;; my @sub_seqs; push @sub_seqs, [ ${^MATCH}, $-[0] ] while $dna =~ m{ atg \K [acgt]+? (?= taa | tag | tga) }xmspg; ;; printf qq{%d sub-sequence(s) \n}, scalar @sub_seqs; ;; print $dna if @sub_seqs; for my $ar_sub_seq (@sub_seqs) { my $cursor = ('-' x $ar_sub_seq->[1]) . ('^' x length $ar_sub_seq->[0]); print $cursor; } ;; my @ss = $dna =~ m{ atg \K [acgt]+? (?= taa | tag | tga) }xmspg; printf qq{'$_' } for @ss; " 2 sub-sequence(s) atctcggataatgggataaaaatataggctataaatggcgccccggctaattttt -------------^^^ -------------------------------------^^^^^^^^^^ 'gga' 'gcgccccggc'

      Sorry pal. Most of your posts -- especially those regarding regex -- get an upvote from me, but this one got --. Its a crock.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        Sorry pal. Most of your posts -- especially those regarding regex -- get an upvote from me, but this one got --. Its a crock.

        Apparently it was so bad, you tried to -- it three times!

        I was curious about the locus of crockitudinousness and decided to do some benchmarking, usually at the root of these squabbles. (Update: Benchmarked variations include some of those used by kcott here.) I must admit I was shocked, shocked by the results. There were no big surprises until I looked at the effect of the  //p regex modifier. Simply adding this modifier to
            m{ atg ([acgt]+?) (?= taa|tag|tga) }xmsg
        in the  push @ra, $1 variation ($push_cg below, which otherwise performs roughly comparably to the other variations) slows its performance by orders of magnitude, so much so that I didn't have the patience to run the benchmark to completion.

        Am I doing this right? (Update: I.e., is the effect of the use of  //p as in the  $push_KM sub below, which I don't even have the patience to benchmark, really so egregious?) Is this all down to the  //p modifier? And if so, have the proper authorities been notified? If you've touched on this in other threads, I have not been following these discussions as carefully as I ought. Anyway, here's my benchmark code. As always, I would be interested in any comments you might have.

Re: Simple regex question. Grouping with a negative lookahead assertion.
by Anonymous Monk on Jul 14, 2013 at 01:32 UTC

    I am trying to capture the very short string after the 'atg' and before (not including) either 'taa', 'tag', or 'tga'.

    This doesn't explain adequately what you're trying to match,

    but the program/regex you posted does actually match, so what is the problem that you're trying to solve?

    see how it runs with use re 'debug';

      I am trying to capture the very short string after the 'atg' and before (not including) either 'taa', 'tag', or 'tga'.

      The very short string (three nucleotides) are those after the first 'atg' and before any of the three stop codons (in DNA form, i.e., before transcription has occurred)...in other words 'gga' since the 'taa' which immediately follows should (ideally) prevent further matching.

      For example:

      $dna = q/attatcgatgaaattagggctaatctcgcggggcctat/; ^-^ ^-^ match match and exit


      The characters (nucleotides) between the markers (and only these) should be captured and accessible in $1.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1044183]
Approved by toolic
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others surveying the Monastery: (8)
As of 2018-09-26 09:07 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Eventually, "covfefe" will come to mean:













    Results (205 votes). Check out past polls.

    Notices?
    • (Sep 10, 2018 at 22:53 UTC) Welcome new users!