Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

generate regular expression

by khoueiry (Initiate)
on Mar 30, 2006 at 16:21 UTC ( [id://540200]=perlquestion: print w/replies, xml ) Need Help??

khoueiry has asked for the wisdom of the Perl Monks concerning the following question:

I have a general question concerning regular expressions generation.
I have a relatively big sequence of 4 letter alphabet ATGC


e.g:
ATCACTGGTTCCTGGACACTACCCTAAACCTTTGAGGA
AATAACCGCTTTGTTGTTGCGATCGCCTAATAAATATC
AGCGTCTTCGTATGATAAACCAATGCGGAAGTACAAAA
TAAAGAGACTGTATTATGTTACT...

I want to generate regular expression according to users query. Per example, I want to search in my sequence for:

"2 CAC and 2 TTT"

"1 A|T CAC" #at least one CAC should be preceded by A or T

"2 C|A TTT T|G #at least two TTT should be preceded by C or A and followed by T or G

and to finish, I want all these matched in a window of 50 letters without taking overlaps into account.

Actually, I need to know if these kind of search are feasible in regex?
Is there any cpan module that can help me do that?

View that I had to make a new search at every user query, I would like to generate a regex satisfying the query and search in my sequences for that

Thanks for any help

Replies are listed 'Best First'.
Re: generate regular expression
by QM (Parson) on Mar 30, 2006 at 17:03 UTC
    I have no idea what you mean by "2 CAC and 2 TTT". Is this in one string? Side by side? Intermixed?

    Unless you come up with a language that maps one-to-one to Perl regex, your users are going to be surprised from time to time by your implementation.

    Wouldn't it be better to have them use a subset of Perl regex directly?

    You could define what this subset is, and let Perl do all of the heavy lifting.

    -QM
    --
    Quantum Mechanics: The dreams stuff is made of

      QM,

      sorry for the lack of precision on the motifs distribution .
      the patterns should be intermixed and all the sequence should be treated as one string.
      If you can guide me on cpan modules that may refer to some of my needs it will be sufficient I think.

      Thanks a lot

      Pierre
        I think you missed the point.

        You're asking for a CPAN module to convert "English to Perl Regex". The reason Perl isn't written completely in English is because English is often ambiguous. Just examine this thread -- if we have to ask for clarification, then it's not useful as a regex specification.

        You would be better served by dumping the idea of a module to solve your problem, and train your users. If they can't specify what they want clearly, having it in English or another natural language isn't going to help.

        For example, depending on what you mean by "2 CAC and 2 TTT", this might DWYW:

        ((() = m/(CAC)/g) == 2) and ((()=m/(TTT)/g) == 2)
        But even with the context-sensitive regex engine, it is awkward (and error-prone) to specify this only in a single regex.

        -QM
        --
        Quantum Mechanics: The dreams stuff is made of

Re: generate regular expression
by wazzuteke (Hermit) on Mar 30, 2006 at 19:14 UTC
    Another solution might just to generate the regular expression based on the user input. Assuming simple command-line script (or something similar, I didn't see the source of 'users query'. Easily ported to a web-app or something along those lines). You could do something along the lines of:
    #!/usr/bin/perl use strict; use warnings; my $regex_cache = {}; die "Usage $0 [CAC <NUM> TTT <NUM>]\n" if ( @ARGV % 2 ); my %input = @ARGV; # Where the input would be : CAC 2 TTT 2 my $data_set = <DATA>; for my $in ( keys %input ) { my $reg_key = "($in)\{$input{$in}\}"; my $reg = $regex_cache->{$in} ||= qr/($reg_key)/s; print "$1\n" if ( $data_set =~ $reg ); } __DATA__ ATCACCACTTCCTGGACACTACCCTAAACCTTTGAGGA AATAACCGCTTTGTTGTTGCGATCGCCTAATAAATATC AGCGTCTTCGTATGATAAACCAATGCGGAAGTACAAAA
    Now, much like the other comments in this thread, I'm not really sure what type of order you are looking for in the set. If 2 CAC means 'CACCAC' or 'CAC\w*CAC', etc. Given the fact that I may have missed this, the compilation of the $reg_key can be changed to be something else. Nevertheless, it will still be able to parse the file based of some sort of input, which is what I believe you were generally looking for.

    Sorry if I'm way off base here and my input doesn't help, although I certainly hope it does. Good luck!

    ---hA||ta----
    print$_ for(map{chr($_)}split(/\s+/,join(/\B?:\w+[^\s+]/,<DATA>))); __DATA__ 67 111 100 101 32 80 101 114 108
      Thanks, I ment by 2 CAC a separated motifs (CAC\w*CAC). I will test that.

Re: generate regular expression
by doc_faustroll (Scribe) on Mar 30, 2006 at 16:31 UTC
    This looks suspicously like bioperl territory. and you are looking at something of an app here. download bioperl. read the docs. think about the problem space.
      Thanks a lot,
      I already know bioperl and its packages. There is no package in bioperl to treat that. I'm posting in perlmonks to "seek perl wisdom" on that issue.
Re: generate regular expression
by injunjoel (Priest) on Mar 30, 2006 at 18:53 UTC
    Greetings,
    In your example you have two different situations:
    The "and" situation "2 CAC and 2 TTT" in which case its really two seperate searches.
    And the "single" search situation "1 A|T CAC", "2 C|A TTT T|G".

    If your search criteria is submitted as you have specified you could split the query on "and" or "or" to handle your first situation. Once split, capture out the count criteria and test with the scalar return value from your array of matches using the
    (@array) = $string =~ /$pattern/g
    idiom.

    Untested Idea!
    my $base_data = 'ATCACTGGTTCCTGGACACTACCCTAAACCTTTGAGGA AATAACCGCTTTGTTGTTGCGATCGCCTAATAAATATC AGCGTCTTCGTATGATAAACCAATGCGGAAGTACAAAA TAAAGAGACTGTATTATGTTACT'; #the user submitted search pattern my $search_submitted = '2 CAC and 2 TTT'; #split it into chunks if applicable. my @search_chunks = split /and|or/, $search_submitted; #for each distinct pattern foreach my $chunk (@search_chunks){ #get the count we are looking for and the pattern we want to use my ($count, $search_string) = $chunk =~ /\s?(\d+)\s?([ATGCU\s\|]+) +/; #replace the |'s with character classes. $search_string =~ s/([ATGCU])\|([ATGCU])/[$1$2]/g; #replace all spaces $search_string =~ s/\s+//g; #run the match and see how many we get. my (@search_count) = $base_data =~ /$search_string/g; #check our results. if(scalar @search_count >= $count){ print "Found it!\n"; }else{ print "Nope...".scalar @search_count."\n"; } }
    Is that sort of what you were thinking of?


    -InjunJoel
    "I do not feel obliged to believe that the same God who endowed us with sense, reason and intellect has intended us to forego their use." -Galileo
      Thanks a lot.
      Actually it is too close of what i was thinking of. I found that I have to split the search to different step instead of making only one complicated query.

      Pierre
Re: generate regular expression
by swampyankee (Parson) on Mar 30, 2006 at 18:34 UTC

    I think I understand what you're looking for: you want a function (or module) that will take natural language queries ("2 CAC and 2 TTT") and convert them into Perl code, which may or may not include regex.

    Now, my regex skills are below par, but I believe your second and third cases can be managed with single regexes. Your first case may require two.

    emc

    "Being forced to write comments actually improves code, because it is easier to fix a crock than to explain it. "
    —G. Steele

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://540200]
Approved by Tanktalus
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others musing on the Monastery: (7)
As of 2024-04-23 19:03 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found