http://www.perlmonks.org?node_id=1199101

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hello fellow monks!
I am seeking your help in the following problem of mine when playing with some DNA sequences:
Say you have 2 strings:
$small_str='AAATTGGTGTATATGAAAGACCTCGACGCTATTTAGAAAGAGAGAGCAATATTTCAAG +AATGCATGCGTCAATTTTACGCAGACTATCTTTCTAGGGTTAAATATACTGACAGTGTGCAGTGACTCA +CAAAAGATGATTA'; $reference_str = 'ACAATGAGATCACATGGACACAGGAAGGGGAATATCACACTCTGGGGACTGT +GGTGGGGTCGGGGGAGGGGGGAGGGATAGCATTGGGAGATATACCTAATGCTAGATGACGTCCATACTG +AGAATCATGTTAACATTAGTGGGTGCAGCGCACAAGCATGGCACATGTATACATATGTAACTAACCTGC +ACAATGTGCACATGTACCCTAAAACTTAGAGTATAATAAAAAAAAAAAAAAAAAAAAAAAAAAACACAT +TAAAAAAAAAAAAAACAACAAAACAAAGCAAACATGGAAATGTTTGTTATTTTAATTGTTATGATGGTT +TCATGGCTGTTTGCATGTGTCAAAACTCATCAAATTTGTGTACGTTAAATATGTGAAACTTATTGTATG +CTGGTTACACCTCAATAAAGCTGTTAAATTTAAAAAAAAAAAAAAAAAAAAAAATCACCAATAGTTGCT +GCTAGAAATCCAGTGTCACAAAAGGCCAAAGTTTATTGACAAATTGGTGTATATGAAAGACCTCGACGC +TATTTAGAAAGAGAGAGCAATATTTCAAGAATGCATGCGTCAATTTTACGCAGACTATCTTTCTAGGGT +TAATCTAGCTGCATCAGGATCATATCGTCGGGTCTTTTTTCCGGCTCAGTCATCGCCCAAGCTGGCGCT +ATCTGGGCATCGGGGAGGAAGAAGCCCGTGCCTTTTCCCGCGAGGTTGAAGCGGCATGGAAAGAGTTTG +CCGAGGATGACTGCTGCTGCATTGACGTTGAGCGAAAACGCACGTTTACCATGATGATTCGGGAAGGTG +TGGCCATGCACGCCTTTAACGGTGAACTGTTCGTTCAGGCCACCTGGGATACCAGTTCGTCGCGGCTTT +TCCGGACACAGTTCCGGATGGTCAGCCCGAAGCGCATCAGCAACCCGAACAATACCGGCGACAGCCGGA +ACTGCCGTGCCGGTGTGCAGATTAATGACAGCGGTGCGGCGCTGGGATATTACGTCAGCGAGGACGGGT +ATCCTGGCTGGATGCCGCAGAAATGGACATGGATACCCCGTGAGTTACCCGGCGGGCGCGCTTGGCGTA +ATCATGGTCATAGCTGTTTCCTGTGTGAAATTGTTATCCGCTCACAATTCCACACAACATACGAGCCGG +AAGCATAAAGTGTAAAGCCTGGGGTGCCTAATGAGTGAGCTAACTCACATTAATTGCGTTGCGCTCACT +GCCCGCTTTCCAGTCGGGAAACCTGTCGTGCCAGCTGCATTAATGAATCGGCCAACGCGCGGGGAGAGG +CGGTTTGCGTATTGGGCGCTCTTCCGCTTCCTCGCTCACTGACTCGCTGCGCTCGGTCGTTCGGCTGCG +GCGAGCGGTATCAGCTCACTCAAAGGCGGTAATACGGTTATCCACAGAATCAGGGGATAACGCAGGAAA +GAACATGTGAGCAAAAGGCCAGCAAAAGGCCAGGAACCGTAAAAAGGCCGCGTTGCTGGCGTTTTTCCA +TAGGCTCCGCCCCCCTGACGAGCATCACAAAAATCGACGCTCAAGTCAGAGGTGGCGAAACCCGACAGG +ACTATAAAGATACCAGGCGTTTCCCCCTGGAAGCTCCCTCGTGCGCTCTCCTGTTCCGACCCTGCCGCT +TACCGGATACCTGTCCGCCTTTCTCCCTTCGGGAAGCGTGGCGCTTTCTCATAGCTCACGCTGTAGGTA +TCTCAGTTCGGTGTAGGTCGTTCGCTCCAAGCTGGGCTGTGTGCACGAACCCCCCGTTCAGCCCGACCG +CTGCGCCTTATCCGGTAACTATCGTCTTGAGTCCAACCCGGTAAGACACGACTTATCGCCACTGGCAGC +AGCCACTGGTAACAGGATTAGCAGAGCGAGGTATGTAGGCGGTGCTACAGAGTTCTTGAAGTGGTGGCC +TAACTACGGCTACACTAGAAGGACAGTATTTGGTATCTGCGCTCTGCTGAAGCCAGTTACCTTCGGAAA +AAGAGTTGGTAGCTCTTGATCCGGCAAACAAACCACCGCTGGTAGCGGTGGTTTTTTTGTTTGCAAGCA +GCAGATTACGCGCAGAAAAAAAGGATCTCAAGAAGATCCTTTGATCTTTTCTACGGGGTCTGACGCTCA +GTGGAACGAAAACTCACGTTAAGGGATTTTGGTCATGAGATTATCAAAAAGGATCTTCACCTAGATCCT +TTTAAATTAAAAATGAAGTTTTAAATCAATCTAAAGTATATATGAGTAAACTTGGTCTGACAGTTACCA +ATGCTTAATCAGTGAGGCACCTATCTCAGCGATCTGTCTATTTCGTTCATCCATAGTTGCCTGACTCCC +CGTCGTGTAGATAACTACGATACGGGAGGGCTTACCATCTGGCCCCAGTGCTGCAATGATACCGCGAGA +CCCACGCTCACCGGCTCCAGATTTATCAGCAATAAACCAGCCAGCCGGAAGGGCCGAGCGCAGAAGTGG +TCCTGCAACTTTATCCGCCTCCATCCAGTCTATTAATTGTTGCCGGGAAGCTAGAGTAAGTAGTTCGCC +AGTTAATAGTTTGCGCAACGTTGTTGCCATTGCTACAGGCATCGTGGTGTCACGCTCGTCGTTTGGTAT +GGCTTCATTCAGCTCCGGTTCCCAACGATCAAGGCGAGTTACATGATCCCCCATGTTGTGCAAAAAAGC +GGTTAGCTCCTTCGGTCCTCCGATCGTTGTCAGAAGTAAGTTGGCCGCAGTGTTATCACTCATGGTTAT +GGCAGCACTGCATAATTCTCTTACTGTCATGCCATCCGTAAGATGCTTTTCTGTGACTGGTGAGTACTC +AACCAAGTCATTCTGAGAATAGTGTATGCGGCGACCGAGTTGCTCTTGCCCGGCGTCAATACGGGATAA +TACCGCGCCACATAGCAGAACTTTAAAAGTGCTCATCATTGGAAAACGTTCTTCGGGGCGAAAACTCTC +AAGGATCTTACCGCTGTTGAGATCCAGTTCGATGTAACCCACTCGTGCACCCAACTGATCTTCAGCATC +TTTTACTTTCACCAGCGTTTCTGGGTGAGCAAAAACAGGAAGGCAAAATGCCGCAAAAAAGGGAAAAGG +GCGACACGGAAATGTTGAATACTCAT';

What I am supposed to fix in my pattern matching is that I am looking whether the small string is found, with a stretch of at least 10 consecutive characters, but 1 mismatch is allowed.
So, my problems are two:
1. How do you define "at least 10 characters of a substring"
2. How do you allow that "one of these 10 can not be an exact match"
Would I , for instance, start like this?
if($reference_str=~/$small_str{10,}/)

and then I would need to account for a mismatch within this "at least 10 matches" Thank you!

Replies are listed 'Best First'.
Re: How do you match a stretch of at least N characters
by choroba (Cardinal) on Sep 11, 2017 at 14:02 UTC
    Iterate over the possible positions in the longer string. Use XOR to get a string that contains zero bytes in positions where two strings are the same. Then just use tr to count the number of zeroes in substrings of length 10, report the positions if the number is 9 or 10.
    #!/usr/bin/perl use warnings; use strict; use feature qw{ say }; my $small_str = 'AAATTGGTGTATATGAAAGACCTCGACGCTATTTAGAAAGAGAGAGCAATATT +TCAAGAAT' . 'GCATGCGTCAATTTTACGCAGACTATCTTTCTAGGGTTAAATATACTGACAGTGTGCAGTGAC +TCACAAAA' . 'GATGATTA'; my $reference_str = 'ACAATGAGATCACATGGACACAGGAAGGGGAATATCACACTCTGGGGAC +TGTGGTGG' . 'GGTCGGGGGAGGGGGGAGGGATAGCATTGGGAGATATACCTAATGCTAGATGACGTCCATACT +GAGAATCA' . 'TGTTAACATTAGTGGGTGCAGCGCACAAGCATGGCACATGTATACATATGTAACTAACCTGCA +CAATGTGC' . 'ACATGTACCCTAAAACTTAGAGTATAATAAAAAAAAAAAAAAAAAAAAAAAAAAACACATTAA +AAAAAAAA' . 'AAAACAACAAAACAAAGCAAACATGGAAATGTTTGTTATTTTAATTGTTATGATGGTTTCATG +GCTGTTTG' . 'CATGTGTCAAAACTCATCAAATTTGTGTACGTTAAATATGTGAAACTTATTGTATGCTGGTTA +CACCTCAA' . 'TAAAGCTGTTAAATTTAAAAAAAAAAAAAAAAAAAAAAATCACCAATAGTTGCTGCTAGAAAT +CCAGTGTC' . 'ACAAAAGGCCAAAGTTTATTGACAAATTGGTGTATATGAAAGACCTCGACGCTATTTAGAAAG +AGAGAGCA' . 'ATATTTCAAGAATGCATGCGTCAATTTTACGCAGACTATCTTTCTAGGGTTAATCTAGCTGCA +TCAGGATC' . 'ATATCGTCGGGTCTTTTTTCCGGCTCAGTCATCGCCCAAGCTGGCGCTATCTGGGCATCGGGG +AGGAAGAA' . 'GCCCGTGCCTTTTCCCGCGAGGTTGAAGCGGCATGGAAAGAGTTTGCCGAGGATGACTGCTGC +TGCATTGA' . 'CGTTGAGCGAAAACGCACGTTTACCATGATGATTCGGGAAGGTGTGGCCATGCACGCCTTTAA +CGGTGAAC' . 'TGTTCGTTCAGGCCACCTGGGATACCAGTTCGTCGCGGCTTTTCCGGACACAGTTCCGGATGG +TCAGCCCG' . 'AAGCGCATCAGCAACCCGAACAATACCGGCGACAGCCGGAACTGCCGTGCCGGTGTGCAGATT +AATGACAG' . 'CGGTGCGGCGCTGGGATATTACGTCAGCGAGGACGGGTATCCTGGCTGGATGCCGCAGAAATG +GACATGGA' . 'TACCCCGTGAGTTACCCGGCGGGCGCGCTTGGCGTAATCATGGTCATAGCTGTTTCCTGTGTG +AAATTGTT' . 'ATCCGCTCACAATTCCACACAACATACGAGCCGGAAGCATAAAGTGTAAAGCCTGGGGTGCCT +AATGAGTG' . 'AGCTAACTCACATTAATTGCGTTGCGCTCACTGCCCGCTTTCCAGTCGGGAAACCTGTCGTGC +CAGCTGCA' . 'TTAATGAATCGGCCAACGCGCGGGGAGAGGCGGTTTGCGTATTGGGCGCTCTTCCGCTTCCTC +GCTCACTG' . 'ACTCGCTGCGCTCGGTCGTTCGGCTGCGGCGAGCGGTATCAGCTCACTCAAAGGCGGTAATAC +GGTTATCC' . 'ACAGAATCAGGGGATAACGCAGGAAAGAACATGTGAGCAAAAGGCCAGCAAAAGGCCAGGAAC +CGTAAAAA' . 'GGCCGCGTTGCTGGCGTTTTTCCATAGGCTCCGCCCCCCTGACGAGCATCACAAAAATCGACG +CTCAAGTC' . 'AGAGGTGGCGAAACCCGACAGGACTATAAAGATACCAGGCGTTTCCCCCTGGAAGCTCCCTCG +TGCGCTCT' . 'CCTGTTCCGACCCTGCCGCTTACCGGATACCTGTCCGCCTTTCTCCCTTCGGGAAGCGTGGCG +CTTTCTCA' . 'TAGCTCACGCTGTAGGTATCTCAGTTCGGTGTAGGTCGTTCGCTCCAAGCTGGGCTGTGTGCA +CGAACCCC' . 'CCGTTCAGCCCGACCGCTGCGCCTTATCCGGTAACTATCGTCTTGAGTCCAACCCGGTAAGAC +ACGACTTA' . 'TCGCCACTGGCAGCAGCCACTGGTAACAGGATTAGCAGAGCGAGGTATGTAGGCGGTGCTACA +GAGTTCTT' . 'GAAGTGGTGGCCTAACTACGGCTACACTAGAAGGACAGTATTTGGTATCTGCGCTCTGCTGAA +GCCAGTTA' . 'CCTTCGGAAAAAGAGTTGGTAGCTCTTGATCCGGCAAACAAACCACCGCTGGTAGCGGTGGTT +TTTTTGTT' . 'TGCAAGCAGCAGATTACGCGCAGAAAAAAAGGATCTCAAGAAGATCCTTTGATCTTTTCTACG +GGGTCTGA' . 'CGCTCAGTGGAACGAAAACTCACGTTAAGGGATTTTGGTCATGAGATTATCAAAAAGGATCTT +CACCTAGA' . 'TCCTTTTAAATTAAAAATGAAGTTTTAAATCAATCTAAAGTATATATGAGTAAACTTGGTCTG +ACAGTTAC' . 'CAATGCTTAATCAGTGAGGCACCTATCTCAGCGATCTGTCTATTTCGTTCATCCATAGTTGCC +TGACTCCC' . 'CGTCGTGTAGATAACTACGATACGGGAGGGCTTACCATCTGGCCCCAGTGCTGCAATGATACC +GCGAGACC' . 'CACGCTCACCGGCTCCAGATTTATCAGCAATAAACCAGCCAGCCGGAAGGGCCGAGCGCAGAA +GTGGTCCT' . 'GCAACTTTATCCGCCTCCATCCAGTCTATTAATTGTTGCCGGGAAGCTAGAGTAAGTAGTTCG +CCAGTTAA' . 'TAGTTTGCGCAACGTTGTTGCCATTGCTACAGGCATCGTGGTGTCACGCTCGTCGTTTGGTAT +GGCTTCAT' . 'TCAGCTCCGGTTCCCAACGATCAAGGCGAGTTACATGATCCCCCATGTTGTGCAAAAAAGCGG +TTAGCTCC' . 'TTCGGTCCTCCGATCGTTGTCAGAAGTAAGTTGGCCGCAGTGTTATCACTCATGGTTATGGCA +GCACTGCA' . 'TAATTCTCTTACTGTCATGCCATCCGTAAGATGCTTTTCTGTGACTGGTGAGTACTCAACCAA +GTCATTCT' . 'GAGAATAGTGTATGCGGCGACCGAGTTGCTCTTGCCCGGCGTCAATACGGGATAATACCGCGC +CACATAGC' . 'AGAACTTTAAAAGTGCTCATCATTGGAAAACGTTCTTCGGGGCGAAAACTCTCAAGGATCTTA +CCGCTGTT' . 'GAGATCCAGTTCGATGTAACCCACTCGTGCACCCAACTGATCTTCAGCATCTTTTACTTTCAC +CAGCGTTT' . 'CTGGGTGAGCAAAAACAGGAAGGCAAAATGCCGCAAAAAAGGGAAAAGGGCGACACGGAAATG +TTGAATAC' . 'TCAT'; for my $start_pos (0 .. length($reference_str) - 11) { my $xor = $small_str ^ substr $reference_str, $start_pos, length($ +reference_str) - 1; for my $inner_pos (0 .. length($xor) - 10) { if (9 <= substr($xor, $inner_pos, 10) =~ tr/\0//) { say $start_pos + $inner_pos; say substr $small_str, $inner_pos, 10; say substr $reference_str, $start_pos + $inner_pos, 10; } } }

    TODO: Handle matches where the two strings don't fully overlap.

    Update Probably like this? It wraps the longer string in characters that can never match, so some simple math is needed to get the original position back:

    ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,
      Choroba, thank you very much for the code. I am bit confused, is the second part an addition to the previous one? Because - I think - I get only exact matches with the second part, or not?
        No, it's a replacement. It reports 107 matches for me, 92 of them are exact.

        ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,
Re: How do you match a stretch of at least N characters
by hippo (Bishop) on Sep 11, 2017 at 14:14 UTC

    Let's see here. Your small string has 140 characters and your reference string has 3114. If you were just comparing exactly 10-character substrings that's 131x3105 which is 406755 comparisons. If you then make that >10 character substrings it becomes a lot heavier. eg. for 11, it's 130x3104 which is 403520 to be added on to your original 406755. Build that all the way up to 140 and it comes out as 26,471,170. Then you have the possibility that any one of those characters could be a non-match, so for a substring of length n that's n more matches. That gives a total by my calculations of 1,403,490,770 - and this is just the comparisons, nevermind all the slicing and dicing to achieve them.

    Does your task have significant time constraints? If so, you'll have to start looking into some algorithm references to be a lot smarter about this, I fear.

    Would I , for instance, start like this? if($reference_str=~/$small_str{10,}/)

    No, that would be 10 or more repititions of the entirety of $small_str with 10 or more instances of its last character at the end. (thanks haukex for pointing out my initial error).

Re: How do you match a stretch of at least N characters
by Eily (Monsignor) on Sep 11, 2017 at 14:13 UTC

    One way you could to it is use the ^ operator. When applied on strings, it will output 0 wherever the two are equal, and something else wherever they are different. That way searching for a series of identical characters becomes searching for a series of 0.

    for my $position (0..(length($reference_str)-length($small_str))) { my $processed_str = $small_str ^ substr($reference_str, $position, l +ength($small_str)); next unless $processed_str =~ m<( \0{9} | \0{1} . \0{8,} | \0{2} . \0{7,} | \0{3} . \0{6,} | \0{4} . \0{5,} | \0{5} . \0{4,} | \0{6} . \0{3,} | \0{7} . \0{2,} | \0{8} . \0{1,} )>xms; print("Match found! ", substr($reference_str, $position+pos($proces +sed_str),length($1))) and last; }
    Untested :)

Re: How do you match a stretch of at least N characters
by stall (Novice) on Sep 12, 2017 at 07:55 UTC
Re: How do you match a stretch of at least N characters
by QM (Parson) on Sep 12, 2017 at 08:02 UTC
    Just thought I'd have a go at it myself, using regex instead of the (obviously better) XOR approach.

    My rough check, with 1 mismatch, works out to the same order of magnitude as choroba's first try, but maybe 2x slower. With more mismatches, my solution goes exponential creating regexes, but should take a bit less time in the matching phase. (Hand-wavy arguments go here.)

    Code:

    -QM
    --
    Quantum Mechanics: The dreams stuff is made of

Re: How do you match a stretch of at least N characters
by 1nickt (Canon) on Sep 11, 2017 at 13:59 UTC

    How can a string simultaneously both match and be a mismatch? I suggest using more than one step/regex.


    The way forward always starts with a minimal test.
      That is what I am a bit confused with. A colleague asked me if I can do that and I am still trying to figure out how this can be possible...
      Ok, but how about that : how would one go in order to find all possible k-mers of the small substring that match the big one and then maybe see if they are some that are only separated by 1? Would that make sense? If yes, how do you do it? Split the small string and search it character-by-character?
Re: How do you match a stretch of at least N characters
by Anonymous Monk on Sep 11, 2017 at 17:52 UTC
    Do not waste your own time. Search CPAN for Bio::DB ...

      Thanks Dr Wisenheimer.

      «The Crux of the Biscuit is the Apostrophe»

      perl -MCrypt::CBC -E 'say Crypt::CBC->new(-key=>'kgb',-cipher=>"Blowfish")->decrypt_hex($ENV{KARL});'Help

        That phrasing may not have been overly helpful... but it's not a bad idea. The SO thread that poj mentioned in the similar Re: Longest common substring with N mismatches points out Bio::Grep::Backend::Agrep, whose synopsis reads to me like it probably does what the OP wants. The parent module Bio::Grep is what actually gets use'd, and lists some of the abilities of the various back-end modules, so the OP might want to check out that page as well for other pertinent information.

Re: How do you match a stretch of at least N characters
by paisani (Acolyte) on Sep 12, 2017 at 19:48 UTC
    Added comments...
    ## Travel small_str along the length of reference_str - ## 107 matches, 92 exact in 3 seconds my $match=0; for my $i (0..length($reference_str)) { for my $j (0..length($small_str)) { my $ref = substr($reference_str, $i, 10); my @ref_arr = split('' +, $ref); my $small = substr($small_str, $j, 10); my @small_arr = split('' +, $small); ## Don't bother if you have less than (9 or) 10 left ;-) last if length($ref) < 10 or length($small) < 10; if ( equal(\@ref_arr, \@small_arr) ) { $match++; print "Found match $match: @ref_arr, @small_arr\n"; } } } print "Final tally - $match\n"; exit(0); sub equal { my ($first, $second) = @_; my $mismatch=0; foreach my $i (0..9) { $mismatch++ if $first->[$i] ne $second->[$i]; return undef if $mismatch > 1; # Bail if you have more than 1 mismat +ch in the sequence } return 1; }
Re: How do you match a stretch of at least N characters
by ikegami (Patriarch) on Sep 11, 2017 at 21:45 UTC

    Original post deleted. What I posted was completely wrong.

      Original post deleted. What I posted was completely wrong.

      You're supposed to <strike></strike> out the wrong answer not delete

        Is this what you were looking for? The longest matching DNA string? You can save the max length found below to get your result.

        Answer:

        Found match of length (102) at position(506, 0):

        AAATTGGTGTATATGAAAGACCTCGACGCTATTTAGAAAGAGAGAGCAATATTTCAAGAATGCATGCGTCAATTTTACGCAGACTATCTTTCTAGGGTTAAT
        AAATTGGTGTATATGAAAGACCTCGACGCTATTTAGAAAGAGAGAGCAATATTTCAAGAATGCATGCGTCAATTTTACGCAGACTATCTTTCTAGGGTTAAA

        my $i=-1; while(++$i < length($reference_str)) { my $j=-1; while(++$j < length($small_str)) { my $length = 9; my $match=undef; while ( $i+$length < length($reference_str) and $j+$length < length($small_str) and compareSnippet($i, $j, ++$length) ) { $match++; } if ( $match) { $length--; print "Found match of length ($length) at position($i, $j): +\n"; print substr($reference_str, $i, $length) . "\n"; print substr($small_str, $j, $length). "\n"; $i+= $length; $j+= $length; next; } } } sub compareSnippet { my ( $pos_i, $pos_j, $length) = @_; my $ref = substr($reference_str, $pos_i, $length); my @ref_arr = sp +lit('', $ref); my $small = substr($small_str, $pos_j, $length); my @small_arr = sp +lit('', $small); return undef if length($ref) < $length or length($small) < $length; return equal(\@ref_arr, \@small_arr); } sub equal { my ($first, $second) = @_; my $mismatch=0; foreach my $i (0..(@$first-1)) { $mismatch++ if $first->[$i] ne $second->[$i]; return undef if $mismatch > 1; } # print "Woo! (@$first, @$second)\n" if $mismatch == 0; return 1; }
Re: How do you match a stretch of at least N characters
by sundialsvc4 (Abbot) on Sep 12, 2017 at 14:44 UTC