Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

self limiting regex help

by spq (Friar)
on May 22, 2002 at 15:12 UTC ( #168461=perlquestion: print w/ replies, xml ) Need Help??
spq has asked for the wisdom of the Perl Monks concerning the following question:

I have a database table containing regular expresions used against strings (sequences of characters to be used in DNA oligo synthesis) as part of a QC process. So far, this has worked great. But now I have a condition that I'm having trouble writing the regex for.

The string is required to contain only alpha characters. Case is not important, and mixed cases are allowed, so the i modifier is used for all match expresions.

So here's the outline of the new condition. Any number of A,T,C and G are always allowed. Some orders may contain symbols representing degenerate possitions however. For example, R may be used to represent a position that can be either A or G. The total list of possible alternate symbols is R,Y,M,K,S,W,H,B,V,D, and N. Although any of them may be used, only a total of two different alternate codes can be used in a given string (mechanical limitation of the synthesis machine).

So, the chalange is to have a single regular expresion that will match a sequence containing any number of A,T,C,G and any number of no more than two different characters from the above alternate codes.

Thanks in advance for whatever wisdom and guidence you can bestow!

Comment on self limiting regex help
Re: self limiting regex help
by ferrency (Deacon) on May 22, 2002 at 15:23 UTC
    If your string is allowed to be empty, try this:
    print "match" if $string =~ /[atcg]*([RYMKSWHBVDN][atcg]*){0,2}/i;
    If not, try this:
    print "match" if $string and $string =~ /[atcg]*([RYMKSWHBVDN][atcg]*) +{0,2}/i;
    (Warning, neither was tested)

    Update:Sorry about that: You're both right, I misunderstood the initial request. The regex above matches up to 2 occurrances of any of the alternate codes, not any number of occurrances of up to 2 of the alternate codes.

    You could do what you want with an extremely long regex which enumerates every combination of 2 alternate codes. That's a really bad answer, though: it would be much shorter and more straightforward to do it with regexes supplemented by other perl code.

    Sorry for the wrong answer :)

    Alan

      Hmm, that looks like it would match fine. But I don't see how it would limit a string to containing any number of occurances of only two of the alternate codes?

      In case I wasn't clear in my first posting, the regex should match on a string that is within the QC criteria, but fail if not. So:

      ATCGGTATATATRGTCGAYGCRGTCAGA

      Would be matched, but:

      ATCGGTATATATRGTCGAYGCNGTCAGA

      Wouldn't, because the N near the end introduces a third ambiguity code.

      That doesn't solve the any number of no more than two different characters part. Both solutions posted so far will match up to 2 non-atcg characters, but I read the spec as allowing for, say 50 Rs and 100 Ws to be mixed in, but not one each of Y, H, and N.

      I could be wrong, but I have a feeling that this one can't be solved by a single regex. (Or at least not one written by a mere mortal - there are some regex deities floating around here who might prove me wrong...)

Re: self limiting regex help
by vladb (Vicar) on May 22, 2002 at 15:26 UTC
    To limit the number of 'special' character you may have in a matching text, you could use this:
    /[RYMKSWHBVDN]{0,2}/i
    However, I'm not quite sure how to integrate this piece that would also satisfy this requirement:
    any number of A,T,C,G ...
    
    Could you try something along those lines:
    /[ATCG][RYMKSWHBVDN]{0,2}/i
    Oww, but then, they could be mixed right?

    UPDATE: Oh well, I believe solution offered by ferrency is somewhat closer to what you need (Note: I didn't notice his solution until I actually submitted my alternative ;/)

    _____________________
    $"=q;grep;;$,=q"grep";for(`find . -name ".saves*~"`){s;$/;;;/(.*-(\d+) +-.*)$/;$_=&#91"ps -e -o pid | "," $2 | "," -v "," "]`@$_`?{print" ++ $1"}:{print"- $1"}&&`rm $1`;print"\n";}
      Thanks.

      Although they can be mixed, I suppose stating that there can be any number of ATCG's may mislead. Now that I read your post, I think I may have been getting hung up on that myself. Other QC regex's applied ensure that the string contains only the allowable characters as a class. The current methodology I've applied is to get all relevant QC expresions from the database and try each expresion in turn against the string. Currently there is no ordering, and I don't think that should matter, if all pass. But I could add it.

      So the real factor is only whether or not there are multiple occurances of more than 2 of the allowable class of alternate codes.

Re: self limiting regex help
by danger (Priest) on May 22, 2002 at 15:53 UTC

    If I understand your conditions correctly, this should do:

    print if /^ [ACGT]* ([RYMKSWHBVDN])? (?:[ACGT]|\1)*([RYMKSWHBVDN])? (?:[ACGT]|\1|\2)*$ /ix;

      Eureka! I was just attempting to do something similar, but had discovered that \n doesn't work in a character class.

      Thank you (and everyone who responded) very much for your help!

Re: self limiting regex help
by Molt (Chaplain) on May 22, 2002 at 16:03 UTC

    Okay, having read a fair bit of 'Mastering Regular Expressions' today I'm going to attack this one.

    Not going to fully comment the regexp, but essentially it does the following.. matches any number of ATCG's, then a single exception character which it stores in \1, then any number of ATCG's or \1s, then another single different exception character which it stores in \2, then any number of ATCG's, \1s, or \2s. This pattern is anchored to each end of the string too.

    Sorry the regexp isn't nicely laid out, but it should work.

    If this doesn't do quite what you want let me have some more test data and I'll fix it. Nice puzzle!

    #!/usr/bin/perl -w use strict; my @tests = ( 'ATCG', 'ATCGGTATATATRGTCGAYGCRGTCAGA', 'ATCGGTATATATRGTCGAYGCNGTCAGA', ); foreach (@tests) { if(/^[ATCG]*([RYMKSWHBVDN])?(?:[ATCG]|\1)*([RYMKSWHBVDN])?(?:[ATCG]| +\1|\2)*$/i){ print "$_ matches\n"; } else { print "$_ does not match\n"; } }

    Update: Slight change, original version demanded two exceptional codes. Oops.

    Update 2: Yes, this is the same as Danger's code above. Ah well, two people coming up with the same solution at least inspires confidence.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://168461]
Approved by ferrency
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chanting in the Monastery: (7)
As of 2014-11-28 07:29 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My preferred Perl binaries come from:














    Results (193 votes), past polls