http://www.perlmonks.org?node_id=987226

rnaeye has asked for the wisdom of the Perl Monks concerning the following question:

Hi! Monks,
I am trying to select lines that contain certain length of DNA string. Following code will print any DNA string that is 8 nucleotide or longer. However, I want to print DNA strings that are exactly 8 nucleotide long, such as "ATGATGAC". I thought {8} will match exactly 8 characters, but looks like I am wrong! I also tried ATGC{8,8}; did not work either.
In addition, in a separate program, I want to select DNA strings that are between 8-21 nucleotide long. Can you please give me any suggestions?
Thank you.
PS. I was able to solve this problem using "length" function without using any regex, but I would like to learn the regex solution to this problem.

#!/usr/bin/perl use warnings; use strict; while (<DATA>){ my $string = $_; if ($string =~ /[ATGC]{8}/) { print "$string"; } } __DATA__ @3009W:27:32 GCTCT + %.8:9 @3009W:27:40 TTGGG + 0(*2+ @3009W:31:26 AGCCT + 5<=46 @3009W:31:35 TCAGAAAACTG + 0.5*.--%-0- @3009W:32:34 GGGCCTAACCTGGGAGCCCCT + A@.:158+,--*-%-**--%- @3009W:34:32 CCATCATCTGGGG + :-:>>;;55755& @3009W:36:21 GACTT + (8.7( @3009W:40:24 ATGATCC + 44.0,.% @3009W:42:22 GCTTCCAGGGTCAGTTTGGGAAAC + :@>4;4888)1//**-%+5+25,. @3009W:47:23 GAGCATCGA + %*1.0...- @3009W:49:23 GAGTTCCATCGAAATGTACAAGCTTTACGTTTAAAAC + /3....0304036-22.,--(*.09*00,11),00(. @3009W:14:90 AGCAA + 82528 @3009W:17:84 GAAACACAC + 05?4=:<:0 @3009W:17:95 TTTTTCTTT + ;<<<-07<1 @3009W:19:89 CCTCTACC + ?:>>:;83 @3009W:19:90 AAGAA + :4<;2 @3009W:20:74 GGTTCC + 2&-.2. @3009W:22:94 CATTTGGAA + AAAB9>8>: @3009W:23:79 CTTACAA + @@9@@@@ @3009W:23:93 TCTTTTTC + @@@AAA/A @3009W:24:80 GTGAGC + <AAA@@ @3009W:25:79 AATAT + ?8=.0 @3009W:26:89 AGGCA + BB>BC @3009W:26:99 ATCCATAT + /88(3979 @3009W:27:83 AGGCA + AA>@@

Replies are listed 'Best First'.
Re: Why doesn't quantifier work with character classes?
by BrowserUk (Patriarch) on Aug 13, 2012 at 22:21 UTC
    I thought {8} will match exactly 8 characters, but looks like I am wrong!

    You aren't wrong, it does match exactly 8 characters ... but if those 8 characters are at the start of a line containing more than 8 characters, it still matches exactly the first 8.

    You need to anchor your regex: Ie. /^[ATGC]{8}$/. Now it will only match lines that contain exactly 8 characters.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    The start of some sanity?

      Please correct me if I am wrong (or if I misunderstood your statement), but there is nothing in OP's regex that restricts a match to the beginning of a line. Hence, the eight consecutive characters may appear anywhere on a line. Right?


      What can be asserted without proof can be dismissed without proof. - Christopher Hitchens, 1949-2011
        Hence, the eight consecutive characters may appear anywhere on a line. Right?

        Yes, but since the lines that match the character class in question contain only those characters; and a regex will always match as early as possible; that'll be at the beginning of the line in these cases.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        The start of some sanity?

        Correct!
        "BrowserUk" and "thezip" 's suggestions solved the problem. Thanks.

Re: Why doesn't quantifier work with character classes?
by thezip (Vicar) on Aug 13, 2012 at 22:18 UTC

    Try:

    ... if ($string =~ /^[ATGC]{8}$/) { ...

    Note the caret and dollar-sign. This should match sequences that are exactly eight characters long.


    What can be asserted without proof can be dismissed without proof. - Christopher Hitchens, 1949-2011