Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

An imperfect pattern matcher writer

by scain (Curate)
on Oct 02, 2001 at 17:46 UTC ( [id://116114]=CUFP: print w/replies, xml ) Need Help??

In bioinformatics, it is often important to determine if a particular sequence is in part of the DNA just sequenced. However, since DNA sequencers are imperfect devices, exact matches to the sequence being searched for is rare. Tools for similar sorts of problems are available, noteably BLAST, but it can be over kill to run the BLAST, parse the results, and then make a desision based on the results. So I wrode a script that will take a sequence being searched for (typically 20 to 30 bases long) and write a new perl script that contains a series of regexes that will allow exactly one mismatch, deletion or insertion.

Here is some example input sequences:

>test1 ATGCATGCATGCATGC >test2 ATGCNATGCNATGCNATGCN
which produces this output:
#!/usr/bin/perl -w $seq = $ARGV[0]; if ( # test1 $seq =~ /\w{0,2}TGCATGCATGCATGC/ or $seq =~ /A\w{0,2}GCATGCATGCATGC/ or $seq =~ /AT\w{0,2}CATGCATGCATGC/ or $seq =~ /ATG\w{0,2}ATGCATGCATGC/ or $seq =~ /ATGC\w{0,2}TGCATGCATGC/ or $seq =~ /ATGCA\w{0,2}GCATGCATGC/ or $seq =~ /ATGCAT\w{0,2}CATGCATGC/ or $seq =~ /ATGCATG\w{0,2}ATGCATGC/ or $seq =~ /ATGCATGC\w{0,2}TGCATGC/ or $seq =~ /ATGCATGCA\w{0,2}GCATGC/ or $seq =~ /ATGCATGCAT\w{0,2}CATGC/ or $seq =~ /ATGCATGCATG\w{0,2}ATGC/ or $seq =~ /ATGCATGCATGC\w{0,2}TGC/ or $seq =~ /ATGCATGCATGCA\w{0,2}GC/ or $seq =~ /ATGCATGCATGCAT\w{0,2}C/ or $seq =~ /ATGCATGCATGCATG\w{0,2}/ or $seq =~ /ATGCATGCATGCATGC\w{0,2}/ or $seq =~ /ZZZ/) { $match=$&; $name = "test1"; $matchlength=length($match); $firstpos=index($seq,$match); print "$name, $match, $matchlength, $firstpos \n"; print "Got it!\n"; } elsif ( # test2 $seq =~ /\w{0,2}TGC[ATGCN]ATGC[ATGCN]ATGC[ATGCN]ATGC[ATGCN]/ or $seq =~ /A\w{0,2}GC[ATGCN]ATGC[ATGCN]ATGC[ATGCN]ATGC[ATGCN]/ or $seq =~ /AT\w{0,2}C[ATGCN]ATGC[ATGCN]ATGC[ATGCN]ATGC[ATGCN]/ or $seq =~ /ATG\w{0,2}[ATGCN]ATGC[ATGCN]ATGC[ATGCN]ATGC[ATGCN]/ or $seq =~ /ATGC\w{0,2}ATGC[ATGCN]ATGC[ATGCN]ATGC[ATGCN]/ or $seq =~ /ATGC[ATGCN]\w{0,2}TGC[ATGCN]ATGC[ATGCN]ATGC[ATGCN]/ or $seq =~ /ATGC[ATGCN]A\w{0,2}GC[ATGCN]ATGC[ATGCN]ATGC[ATGCN]/ or $seq =~ /ATGC[ATGCN]AT\w{0,2}C[ATGCN]ATGC[ATGCN]ATGC[ATGCN]/ or $seq =~ /ATGC[ATGCN]ATG\w{0,2}[ATGCN]ATGC[ATGCN]ATGC[ATGCN]/ or $seq =~ /ATGC[ATGCN]ATGC\w{0,2}ATGC[ATGCN]ATGC[ATGCN]/ or $seq =~ /ATGC[ATGCN]ATGC[ATGCN]\w{0,2}TGC[ATGCN]ATGC[ATGCN]/ or $seq =~ /ATGC[ATGCN]ATGC[ATGCN]A\w{0,2}GC[ATGCN]ATGC[ATGCN]/ or $seq =~ /ATGC[ATGCN]ATGC[ATGCN]AT\w{0,2}C[ATGCN]ATGC[ATGCN]/ or $seq =~ /ATGC[ATGCN]ATGC[ATGCN]ATG\w{0,2}[ATGCN]ATGC[ATGCN]/ or $seq =~ /ATGC[ATGCN]ATGC[ATGCN]ATGC\w{0,2}ATGC[ATGCN]/ or $seq =~ /ATGC[ATGCN]ATGC[ATGCN]ATGC[ATGCN]\w{0,2}TGC[ATGCN]/ or $seq =~ /ATGC[ATGCN]ATGC[ATGCN]ATGC[ATGCN]A\w{0,2}GC[ATGCN]/ or $seq =~ /ATGC[ATGCN]ATGC[ATGCN]ATGC[ATGCN]AT\w{0,2}C[ATGCN]/ or $seq =~ /ATGC[ATGCN]ATGC[ATGCN]ATGC[ATGCN]ATG\w{0,2}[ATGCN]/ or $seq =~ /ATGC[ATGCN]ATGC[ATGCN]ATGC[ATGCN]ATGC\w{0,2}/ or $seq =~ /ATGC[ATGCN]ATGC[ATGCN]ATGC[ATGCN]ATGC[ATGCN]\w{0,2}/ or $seq =~ /ZZZ/) { $match=$&; $name = "test2"; $matchlength=length($match); $firstpos=index($seq,$match); print "$name, $match, $matchlength, $firstpos \n"; print "Got it!\n"; } elsif ( "false"){ }
and finally, the code that writes the code:
#!/usr/bin/perl -w use strict; my $FILE_IN = $ARGV[0]; my $FILE_OUT = "patternmatcher.pl"; open FILE_IN, $FILE_IN; open FILE_OUT, ">$FILE_OUT"; print FILE_OUT "\#!/usr/bin/perl -w\n"; print FILE_OUT "\$seq = \$ARGV[0]\;\n"; print FILE_OUT "if (\n"; my $wild = "\\w\{0,2\}"; my $onN = "X"; my $onN2 = "\[ATGCN\]"; my $name; while (<FILE_IN>) { if ( $_ =~ /^\>(.+)$/ ) { $name = $1; } elsif ( $_ =~ /^(\w+)$/ ) { my $seq = $1; my $len = length $seq; print FILE_OUT "\# $name\n"; for my $i ( 0 .. $len ) { #go through $seq base by base, repla +cing one character with the wild card my $newseq = $seq; substr( $newseq, $i, 1 ) = $wild; $newseq =~ s/N/$onN/go; # these to lines are to allow Ns +to match any character $newseq =~ s/X/$onN2/go; # print FILE_OUT " \$seq =~ \/$newseq\/ or \n"; } print FILE_OUT " \$seq =~ \/ZZZ\/) \{\n"; #impossible match t +o close the if print FILE_OUT " \$match=\$\&\;\n"; print FILE_OUT " \$name = \"$name\"\;\n"; print FILE_OUT " \$matchlength=length\(\$match\)\;\n"; print FILE_OUT " \$firstpos=index\(\$seq,\$match)\;\n"; print FILE_OUT " print \"\$name, \$match, \$matchlength, \$firstpos \\n\ +"\;\n"; print FILE_OUT " print \"Got it!\\n\"\;\n"; print FILE_OUT "\} elsif \(\n"; } } print FILE_OUT "\"false\"\)\{\n"; # false value to end the series of e +lsifs print FILE_OUT "\}\n"; close FILE_IN; close FILE_OUT; system( "chmod", "u+x", $FILE_OUT );

Replies are listed 'Best First'.
Re: An imperfect pattern matcher writer
by pjf (Curate) on Oct 02, 2001 at 17:58 UTC
    Nice. I can think of a quick'n'easy speed improvement as well. Given how many times you're (potentially) matching on $seq, a study($seq) before the matches is almost certainly going to hurry things up.

    Study examines a string in preparation of doing many pattern matches, and this seems like just the sort of thing it was built for.

    Cheers,
    Paul

      pjf,

      Thanks; I was unaware of study until now. I will reimplement this with study and do some benchmarking.

      Scott

Re: An imperfect pattern matcher writer
by tommyw (Hermit) on Oct 02, 2001 at 18:23 UTC
      First let say, that String::Approx is implemented in C. Has anyone tested how String::Approx stacks up against agrep? I know they are both based on the same algorithm, the Manber-Wu k-differences algorithm shift-add (what a mouthful!). But Jarkko Hietaniemi has written the code without looking at agrep and is not bound by the agrep license (GPL). agrep is used by the glimpse search engine.

      -- stefp

      tommyw,

      Not a bad idea; I wasn't familar with String::Approx before now either. I am torn now as to whether using it would increase or decrease the level of complexity of the resulting code. Also, since it is not uncommon to be testing thousands of $seqs at a time, I would need to consider execution speed.

      I'll think about implementing it this way and let you know.

      Thanks,
      Scott

YALH Re: An imperfect pattern matcher writer
by jeroenes (Priest) on Oct 02, 2001 at 19:49 UTC
    Yet Another Look-Here:

    Have you checked out Clustalw? In general, http://bio.perl.org might contain lotta handy stuff for you. Including nice BLAST interfaces etc. etc. You can also do a 'bio'-search on cpan (which is your friend, naturally).

    Cheers,

    Jeroen

      jeroen,

      Yes, I should have mentioned bioperl in the original text. I am familar with it; we use it for some things, noteably BLAST parsing (in fact, we provided an improved version of BPlite to the author). I am also familar with clustal, but did not feel quite like the right tool for the job. Thanks for the YALH.

      Scott

Re: An imperfect pattern matcher writer
by Fletch (Bishop) on Oct 03, 2001 at 03:01 UTC

    You might consider replacing the multiple print statements with a heredoc or two. It'll cut down on the backwhacking you've got to do. Or use qq{} rather than ""'s for strings that need to be interpolated which include double quotes.

    ... print FILE_OUT <<'EOT'; $seq =~ /ZZZ/) { $match = $&; EOT print FILE_OUT qq{ \$name = "$name";\n}; print FILE_OUT <<'EOT'; $matchlength = length( $match ); $firstpos = index( $seq, $match ); print "$name, $match, $matchlength, $firstpos\n"; print "Got it!\n"; } elsif { EOT ...

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: CUFP [id://116114]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others avoiding work at the Monastery: (2)
As of 2024-03-19 07:10 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found