An imperfect pattern matcher writer

In bioinformatics, it is often important to determine if a particular sequence is in part of the DNA just sequenced. However, since DNA sequencers are imperfect devices, exact matches to the sequence being searched for is rare. Tools for similar sorts of problems are available, noteably BLAST, but it can be over kill to run the BLAST, parse the results, and then make a desision based on the results. So I wrode a script that will take a sequence being searched for (typically 20 to 30 bases long) and write a new perl script that contains a series of regexes that will allow exactly one mismatch, deletion or insertion.

Here is some example input sequences:

>test1
ATGCATGCATGCATGC
>test2
ATGCNATGCNATGCNATGCN
[download]

which produces this output:

#!/usr/bin/perl -w
$seq = $ARGV[0];
if (
# test1
  $seq =~ /\w{0,2}TGCATGCATGCATGC/ or
  $seq =~ /A\w{0,2}GCATGCATGCATGC/ or
  $seq =~ /AT\w{0,2}CATGCATGCATGC/ or
  $seq =~ /ATG\w{0,2}ATGCATGCATGC/ or
  $seq =~ /ATGC\w{0,2}TGCATGCATGC/ or
  $seq =~ /ATGCA\w{0,2}GCATGCATGC/ or
  $seq =~ /ATGCAT\w{0,2}CATGCATGC/ or
  $seq =~ /ATGCATG\w{0,2}ATGCATGC/ or
  $seq =~ /ATGCATGC\w{0,2}TGCATGC/ or
  $seq =~ /ATGCATGCA\w{0,2}GCATGC/ or
  $seq =~ /ATGCATGCAT\w{0,2}CATGC/ or
  $seq =~ /ATGCATGCATG\w{0,2}ATGC/ or
  $seq =~ /ATGCATGCATGC\w{0,2}TGC/ or
  $seq =~ /ATGCATGCATGCA\w{0,2}GC/ or
  $seq =~ /ATGCATGCATGCAT\w{0,2}C/ or
  $seq =~ /ATGCATGCATGCATG\w{0,2}/ or
  $seq =~ /ATGCATGCATGCATGC\w{0,2}/ or
  $seq =~ /ZZZ/) {
    $match=$&;
    $name = "test1";
    $matchlength=length($match);
    $firstpos=index($seq,$match);
    print "$name, $match, $matchlength, $firstpos \n";
    print "Got it!\n";
} elsif (
# test2
  $seq =~ /\w{0,2}TGC[ATGCN]ATGC[ATGCN]ATGC[ATGCN]ATGC[ATGCN]/ or
  $seq =~ /A\w{0,2}GC[ATGCN]ATGC[ATGCN]ATGC[ATGCN]ATGC[ATGCN]/ or
  $seq =~ /AT\w{0,2}C[ATGCN]ATGC[ATGCN]ATGC[ATGCN]ATGC[ATGCN]/ or
  $seq =~ /ATG\w{0,2}[ATGCN]ATGC[ATGCN]ATGC[ATGCN]ATGC[ATGCN]/ or
  $seq =~ /ATGC\w{0,2}ATGC[ATGCN]ATGC[ATGCN]ATGC[ATGCN]/ or
  $seq =~ /ATGC[ATGCN]\w{0,2}TGC[ATGCN]ATGC[ATGCN]ATGC[ATGCN]/ or
  $seq =~ /ATGC[ATGCN]A\w{0,2}GC[ATGCN]ATGC[ATGCN]ATGC[ATGCN]/ or
  $seq =~ /ATGC[ATGCN]AT\w{0,2}C[ATGCN]ATGC[ATGCN]ATGC[ATGCN]/ or
  $seq =~ /ATGC[ATGCN]ATG\w{0,2}[ATGCN]ATGC[ATGCN]ATGC[ATGCN]/ or
  $seq =~ /ATGC[ATGCN]ATGC\w{0,2}ATGC[ATGCN]ATGC[ATGCN]/ or
  $seq =~ /ATGC[ATGCN]ATGC[ATGCN]\w{0,2}TGC[ATGCN]ATGC[ATGCN]/ or
  $seq =~ /ATGC[ATGCN]ATGC[ATGCN]A\w{0,2}GC[ATGCN]ATGC[ATGCN]/ or
  $seq =~ /ATGC[ATGCN]ATGC[ATGCN]AT\w{0,2}C[ATGCN]ATGC[ATGCN]/ or
  $seq =~ /ATGC[ATGCN]ATGC[ATGCN]ATG\w{0,2}[ATGCN]ATGC[ATGCN]/ or
  $seq =~ /ATGC[ATGCN]ATGC[ATGCN]ATGC\w{0,2}ATGC[ATGCN]/ or
  $seq =~ /ATGC[ATGCN]ATGC[ATGCN]ATGC[ATGCN]\w{0,2}TGC[ATGCN]/ or
  $seq =~ /ATGC[ATGCN]ATGC[ATGCN]ATGC[ATGCN]A\w{0,2}GC[ATGCN]/ or
  $seq =~ /ATGC[ATGCN]ATGC[ATGCN]ATGC[ATGCN]AT\w{0,2}C[ATGCN]/ or
  $seq =~ /ATGC[ATGCN]ATGC[ATGCN]ATGC[ATGCN]ATG\w{0,2}[ATGCN]/ or
  $seq =~ /ATGC[ATGCN]ATGC[ATGCN]ATGC[ATGCN]ATGC\w{0,2}/ or
  $seq =~ /ATGC[ATGCN]ATGC[ATGCN]ATGC[ATGCN]ATGC[ATGCN]\w{0,2}/ or
  $seq =~ /ZZZ/) {
    $match=$&;
    $name = "test2";
    $matchlength=length($match);
    $firstpos=index($seq,$match);
    print "$name, $match, $matchlength, $firstpos \n";
    print "Got it!\n";
} elsif (
"false"){
}
[download]

and finally, the code that writes the code:

#!/usr/bin/perl -w
use strict;

my $FILE_IN  = $ARGV[0];
my $FILE_OUT = "patternmatcher.pl";
open FILE_IN,  $FILE_IN;
open FILE_OUT, ">$FILE_OUT";

print FILE_OUT "\#!/usr/bin/perl -w\n";
print FILE_OUT "\$seq = \$ARGV[0]\;\n";
print FILE_OUT "if (\n";

my $wild = "\\w\{0,2\}";
my $onN  = "X";
my $onN2 = "\[ATGCN\]";
my $name;
while (<FILE_IN>) {
    if ( $_ =~ /^\>(.+)$/ ) {
        $name = $1;
    }
    elsif ( $_ =~ /^(\w+)$/ ) {
        my $seq = $1;
        my $len = length $seq;
        print FILE_OUT "\# $name\n";

        for my $i ( 0 .. $len ) { #go through $seq base by base, repla
+cing one character with the wild card
            my $newseq = $seq;
            substr( $newseq, $i, 1 ) = $wild;
            $newseq =~ s/N/$onN/go;  # these to lines are to allow Ns 
+to match any character
            $newseq =~ s/X/$onN2/go; #
            print FILE_OUT "  \$seq =~ \/$newseq\/ or \n";
        }
        print FILE_OUT "  \$seq =~ \/ZZZ\/) \{\n"; #impossible match t
+o close the if
        print FILE_OUT "    \$match=\$\&\;\n";
        print FILE_OUT "    \$name = \"$name\"\;\n";
        print FILE_OUT "    \$matchlength=length\(\$match\)\;\n";
        print FILE_OUT "    \$firstpos=index\(\$seq,\$match)\;\n";
        print FILE_OUT
          "    print \"\$name, \$match, \$matchlength, \$firstpos \\n\
+"\;\n";
        print FILE_OUT "    print \"Got it!\\n\"\;\n";
        print FILE_OUT "\} elsif \(\n";

    }
}
print FILE_OUT "\"false\"\)\{\n"; # false value to end the series of e
+lsifs

print FILE_OUT "\}\n";
close FILE_IN;
close FILE_OUT;

system( "chmod", "u+x", $FILE_OUT );
[download]

Comment on An imperfect pattern matcher writer Select or Download Code

Replies are listed 'Best First'.
Re: An imperfect pattern matcher writer by pjf (Curate) on Oct 02, 2001 at 17:58 UTC
Nice. I can think of a quick'n'easy speed improvement as well. Given how many times you're (potentially) matching on $seq, a study($seq) before the matches is almost certainly going to hurry things up. Study examines a string in preparation of doing many pattern matches, and this seems like just the sort of thing it was built for. Cheers, Paul	[reply]
Re: Re: An imperfect pattern matcher writer by scain (Curate) on Oct 02, 2001 at 18:07 UTC
pjf, Thanks; I was unaware of study until now. I will reimplement this with study and do some benchmarking. Scott	[reply]
Re: An imperfect pattern matcher writer by tommyw (Hermit) on Oct 02, 2001 at 18:23 UTC
Have you considered String::Approx instead?	[reply]
Re: Re: An imperfect pattern matcher writer by stefp (Vicar) on Oct 03, 2001 at 02:59 UTC
First let say, that String::Approx is implemented in C. Has anyone tested how String::Approx stacks up against agrep? I know they are both based on the same algorithm, the Manber-Wu k-differences algorithm shift-add (what a mouthful!). But Jarkko Hietaniemi has written the code without looking at agrep and is not bound by the agrep license (GPL). agrep is used by the glimpse search engine. -- stefp	[reply]
Re: Re: An imperfect pattern matcher writer by scain (Curate) on Oct 02, 2001 at 18:35 UTC
tommyw, Not a bad idea; I wasn't familar with String::Approx before now either. I am torn now as to whether using it would increase or decrease the level of complexity of the resulting code. Also, since it is not uncommon to be testing thousands of $seqs at a time, I would need to consider execution speed. I'll think about implementing it this way and let you know. Thanks, Scott	[reply]
YALH Re: An imperfect pattern matcher writer by jeroenes (Priest) on Oct 02, 2001 at 19:49 UTC
Yet Another Look-Here: Have you checked out Clustalw? In general, http://bio.perl.org might contain lotta handy stuff for you. Including nice BLAST interfaces etc. etc. You can also do a 'bio'-search on cpan (which is your friend, naturally). Cheers, Jeroen	[reply]
Re: YALH Re: An imperfect pattern matcher writer by scain (Curate) on Oct 02, 2001 at 20:04 UTC
jeroen, Yes, I should have mentioned bioperl in the original text. I am familar with it; we use it for some things, noteably BLAST parsing (in fact, we provided an improved version of BPlite to the author). I am also familar with clustal, but did not feel quite like the right tool for the job. Thanks for the YALH. Scott	[reply]
Re: An imperfect pattern matcher writer by Fletch (Bishop) on Oct 03, 2001 at 03:01 UTC
You might consider replacing the multiple print statements with a heredoc or two. It'll cut down on the backwhacking you've got to do. Or use qq{} rather than ""'s for strings that need to be interpolated which include double quotes. `... print FILE_OUT <<'EOT'; $seq =~ /ZZZ/) { $match = $&; EOT print FILE_OUT qq{ \$name = "$name";\n}; print FILE_OUT <<'EOT'; $matchlength = length( $match ); $firstpos = index( $seq, $match ); print "$name, $match, $matchlength, $firstpos\n"; print "Got it!\n"; } elsif { EOT ...` [download]	[reply] [d/l]

Back to Cool Uses for Perl