In bioinformatics, it is often important to determine if a
particular sequence is in part of the DNA just sequenced. However,
since DNA sequencers are imperfect devices, exact matches to
the sequence being searched for is rare. Tools for similar
sorts of problems are available, noteably BLAST,
but it can be over kill to run the BLAST, parse the results, and then
make a desision based on the results. So I wrode a script that
will take a sequence being searched for (typically 20 to 30
bases long) and write a new perl script that contains a series
of regexes that will allow exactly one mismatch, deletion or
insertion.
Here is some example input sequences:
which produces this output:>test1 ATGCATGCATGCATGC >test2 ATGCNATGCNATGCNATGCN
and finally, the code that writes the code:#!/usr/bin/perl -w $seq = $ARGV[0]; if ( # test1 $seq =~ /\w{0,2}TGCATGCATGCATGC/ or $seq =~ /A\w{0,2}GCATGCATGCATGC/ or $seq =~ /AT\w{0,2}CATGCATGCATGC/ or $seq =~ /ATG\w{0,2}ATGCATGCATGC/ or $seq =~ /ATGC\w{0,2}TGCATGCATGC/ or $seq =~ /ATGCA\w{0,2}GCATGCATGC/ or $seq =~ /ATGCAT\w{0,2}CATGCATGC/ or $seq =~ /ATGCATG\w{0,2}ATGCATGC/ or $seq =~ /ATGCATGC\w{0,2}TGCATGC/ or $seq =~ /ATGCATGCA\w{0,2}GCATGC/ or $seq =~ /ATGCATGCAT\w{0,2}CATGC/ or $seq =~ /ATGCATGCATG\w{0,2}ATGC/ or $seq =~ /ATGCATGCATGC\w{0,2}TGC/ or $seq =~ /ATGCATGCATGCA\w{0,2}GC/ or $seq =~ /ATGCATGCATGCAT\w{0,2}C/ or $seq =~ /ATGCATGCATGCATG\w{0,2}/ or $seq =~ /ATGCATGCATGCATGC\w{0,2}/ or $seq =~ /ZZZ/) { $match=$&; $name = "test1"; $matchlength=length($match); $firstpos=index($seq,$match); print "$name, $match, $matchlength, $firstpos \n"; print "Got it!\n"; } elsif ( # test2 $seq =~ /\w{0,2}TGC[ATGCN]ATGC[ATGCN]ATGC[ATGCN]ATGC[ATGCN]/ or $seq =~ /A\w{0,2}GC[ATGCN]ATGC[ATGCN]ATGC[ATGCN]ATGC[ATGCN]/ or $seq =~ /AT\w{0,2}C[ATGCN]ATGC[ATGCN]ATGC[ATGCN]ATGC[ATGCN]/ or $seq =~ /ATG\w{0,2}[ATGCN]ATGC[ATGCN]ATGC[ATGCN]ATGC[ATGCN]/ or $seq =~ /ATGC\w{0,2}ATGC[ATGCN]ATGC[ATGCN]ATGC[ATGCN]/ or $seq =~ /ATGC[ATGCN]\w{0,2}TGC[ATGCN]ATGC[ATGCN]ATGC[ATGCN]/ or $seq =~ /ATGC[ATGCN]A\w{0,2}GC[ATGCN]ATGC[ATGCN]ATGC[ATGCN]/ or $seq =~ /ATGC[ATGCN]AT\w{0,2}C[ATGCN]ATGC[ATGCN]ATGC[ATGCN]/ or $seq =~ /ATGC[ATGCN]ATG\w{0,2}[ATGCN]ATGC[ATGCN]ATGC[ATGCN]/ or $seq =~ /ATGC[ATGCN]ATGC\w{0,2}ATGC[ATGCN]ATGC[ATGCN]/ or $seq =~ /ATGC[ATGCN]ATGC[ATGCN]\w{0,2}TGC[ATGCN]ATGC[ATGCN]/ or $seq =~ /ATGC[ATGCN]ATGC[ATGCN]A\w{0,2}GC[ATGCN]ATGC[ATGCN]/ or $seq =~ /ATGC[ATGCN]ATGC[ATGCN]AT\w{0,2}C[ATGCN]ATGC[ATGCN]/ or $seq =~ /ATGC[ATGCN]ATGC[ATGCN]ATG\w{0,2}[ATGCN]ATGC[ATGCN]/ or $seq =~ /ATGC[ATGCN]ATGC[ATGCN]ATGC\w{0,2}ATGC[ATGCN]/ or $seq =~ /ATGC[ATGCN]ATGC[ATGCN]ATGC[ATGCN]\w{0,2}TGC[ATGCN]/ or $seq =~ /ATGC[ATGCN]ATGC[ATGCN]ATGC[ATGCN]A\w{0,2}GC[ATGCN]/ or $seq =~ /ATGC[ATGCN]ATGC[ATGCN]ATGC[ATGCN]AT\w{0,2}C[ATGCN]/ or $seq =~ /ATGC[ATGCN]ATGC[ATGCN]ATGC[ATGCN]ATG\w{0,2}[ATGCN]/ or $seq =~ /ATGC[ATGCN]ATGC[ATGCN]ATGC[ATGCN]ATGC\w{0,2}/ or $seq =~ /ATGC[ATGCN]ATGC[ATGCN]ATGC[ATGCN]ATGC[ATGCN]\w{0,2}/ or $seq =~ /ZZZ/) { $match=$&; $name = "test2"; $matchlength=length($match); $firstpos=index($seq,$match); print "$name, $match, $matchlength, $firstpos \n"; print "Got it!\n"; } elsif ( "false"){ }
#!/usr/bin/perl -w use strict; my $FILE_IN = $ARGV[0]; my $FILE_OUT = "patternmatcher.pl"; open FILE_IN, $FILE_IN; open FILE_OUT, ">$FILE_OUT"; print FILE_OUT "\#!/usr/bin/perl -w\n"; print FILE_OUT "\$seq = \$ARGV[0]\;\n"; print FILE_OUT "if (\n"; my $wild = "\\w\{0,2\}"; my $onN = "X"; my $onN2 = "\[ATGCN\]"; my $name; while (<FILE_IN>) { if ( $_ =~ /^\>(.+)$/ ) { $name = $1; } elsif ( $_ =~ /^(\w+)$/ ) { my $seq = $1; my $len = length $seq; print FILE_OUT "\# $name\n"; for my $i ( 0 .. $len ) { #go through $seq base by base, repla +cing one character with the wild card my $newseq = $seq; substr( $newseq, $i, 1 ) = $wild; $newseq =~ s/N/$onN/go; # these to lines are to allow Ns +to match any character $newseq =~ s/X/$onN2/go; # print FILE_OUT " \$seq =~ \/$newseq\/ or \n"; } print FILE_OUT " \$seq =~ \/ZZZ\/) \{\n"; #impossible match t +o close the if print FILE_OUT " \$match=\$\&\;\n"; print FILE_OUT " \$name = \"$name\"\;\n"; print FILE_OUT " \$matchlength=length\(\$match\)\;\n"; print FILE_OUT " \$firstpos=index\(\$seq,\$match)\;\n"; print FILE_OUT " print \"\$name, \$match, \$matchlength, \$firstpos \\n\ +"\;\n"; print FILE_OUT " print \"Got it!\\n\"\;\n"; print FILE_OUT "\} elsif \(\n"; } } print FILE_OUT "\"false\"\)\{\n"; # false value to end the series of e +lsifs print FILE_OUT "\}\n"; close FILE_IN; close FILE_OUT; system( "chmod", "u+x", $FILE_OUT );
|
---|
Replies are listed 'Best First'. | |
---|---|
Re: An imperfect pattern matcher writer
by pjf (Curate) on Oct 02, 2001 at 17:58 UTC | |
by scain (Curate) on Oct 02, 2001 at 18:07 UTC | |
Re: An imperfect pattern matcher writer
by tommyw (Hermit) on Oct 02, 2001 at 18:23 UTC | |
by stefp (Vicar) on Oct 03, 2001 at 02:59 UTC | |
by scain (Curate) on Oct 02, 2001 at 18:35 UTC | |
YALH Re: An imperfect pattern matcher writer
by jeroenes (Priest) on Oct 02, 2001 at 19:49 UTC | |
by scain (Curate) on Oct 02, 2001 at 20:04 UTC | |
Re: An imperfect pattern matcher writer
by Fletch (Bishop) on Oct 03, 2001 at 03:01 UTC |
Back to
Cool Uses for Perl