http://www.perlmonks.org?node_id=976294


in reply to Re^2: counting the number of 16384 pattern matches in a large DNA sequence
in thread counting the number of 16384 pattern matches in a large DNA sequence

One of the inputs is a file of 16384 strings given in below example format:

Given a string of 7 fasta characters, you can form 4^7= 16384 unique strings from T, A, G, C. So, his algorithm is checking every possible 7 char fasta string against the mutifasta file (number of lines in the multifasta file is 61913917).

  • Comment on Re^3: counting the number of 16384 pattern matches in a large DNA sequence

Replies are listed 'Best First'.
Re^4: counting the number of 16384 pattern matches in a large DNA sequence
by aaron_baugher (Curate) on Jun 15, 2012 at 05:27 UTC

    In that case, it seems like a fairly simple regex should do it. I'd think one regex would be faster than using index to check 16384 different substrings, but a benchmark would tell for sure. I'm also not sure why he's reading the entire huge file into a hash; it seems like that could be running him into swap, slowing things down severely. Unless I'm missing something, it seems like this would work (as long as it's not necessary to match overlapping matches):

    #!/usr/bin/env perl use Modern::Perl; my %c; while(<DATA>){ chomp; while(/([ACGT]{7})/g){ $c{$1}++; } } say "$_ : $c{$_}" for sort keys %c; __DATA__ NNNAGTACANNNNTAGCNNNNNNAGGTNNNNNAATCCGATNNNNNNTAGGGGGGTTTAAANNNNN NNNAGTCCCACANNNNTAAAAGCNNNNNNAGGTNNNNNAATCCGATNNNNNNTAGGGGGGTTTAAANNNN +N NNNAGTACANNNNTAGCNNNNNNAGGTNNNNNAATCCGATNNNNNNTAGGGGGGTTTAAANNNNN

    Aaron B.
    Available for small or large Perl jobs; see my home node.

      Unless I'm missing something, ... (as long as it's not necessary to match overlapping matches):

      You hit the nail on the head. You'll only match 5,015,229 times when the OPs code matches 35,106,546 times.

      However, with a modification to your regex, you can avoid that problem and find overlapping matches:

      ++$index{ $1 } while $$rSeq =~ m[(?=([ACGT]{7}))]g;

      But it is still much slower than avoiding the regex engine completely:

      [ 6:55:26.00] C:\test\humanGenome>..\976237 chr21.fa 16384 Using custom indexing found 35106546 matches; took 31.611258 seconds Using custom index2 found 35106546 matches; took 27.504099 seconds Using custom index3 found 35106546 matches; took 27.571143 seconds Using quantified charclass lookahead found 35106546 matches; took 49.9 +54810 seconds

      But ++ for thinking outside the box. (I can't believe I actually used that phrase :)


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

      The start of some sanity?

        an untested variation:
        while(/([ACGT]{7,})/g) { for my $ix (0..lenght($1) - 7) { ++$index{substr($1, $ix, 7)} } }
        This regular expression should process every character on the string just once and so be an order of magnitude faster than yours which tries to match the look-ahead pattern at every char.

        But that is just guessing... could you benchmark it?

        Funny, I thought I was getting back inside the box by using a simple regex. :-)

        I've never been totally comfortable with the zero-width assertions, because logically it seems to me that that should keep matching in the same spot. If zero-width really means zero-width, the next match should start at the same point and match the same string again, shouldn't it? I guess the semantics of it just bug me. I feel like I should have to use something like this to make sure it advances a character:

        m[([ACGT](?=[ACGT]{6}))];

        Anyway, I'm impressed that the regex does as well as it does, considering how simple it was to implement.

        Aaron B.
        Available for small or large Perl jobs; see my home node.