spring has asked for the wisdom of the Perl Monks concerning the following question:

I have a file with a short probe each line (~10 letters in length, 1 million lines),and another chromosome file which can be viewed as a huge-size string (30 million letters). Now I want to find match position of each 'probe' against the 'chromosome' if matched.
file1: ATCGATGTGT CCGATGCTGA ... file2: CCGATGCTGAggatgtcgcgcagatttaggCCGATGCTGAgagcatgaa...
Only one hit is possible for each 'probe' string. I read in the 2 files into variables: $chrom and @probes, respectively:
my $matched=''; foreach (@probes){ chomp $_; if ($chrom=~/$_/gi) { my $position=pos($chrom); $matched.= "$_\t$position\n"; } }
But using this method the script runs too slow for huge files. If there any quicker way around? Thanks. ^Appreciate your time!$

Replies are listed 'Best First'.
Re: pattern match, speed problem
by BrowserUk (Pope) on Feb 20, 2008 at 08:48 UTC

    A quick check estimates that it will take around 1 hour 40 minutes to search a 30MB string for 1 million 10-char strings using index.

    Instead of searching the whole string for each of the probes, it is far quicker to build a hash of the probes and then scan the string once, picking off 10-char substrings and testing to see if it exists in the hash:

    #! perl -slw use strict; use constant { CHROM => 'chromosome.txt', PROBES => 'probes.txt', }; my $chrom; open C, '<', CHROM or die CHROM . ": $!"; sysread C, $chrom, -s( CHROM ) or die $!; close C; $chrom = uc $chrom; my %probes; open P, '<', PROBES or die $!; chomp, undef $probes{ $_ } while <P>; close P; warn time; my $p; for ( 0 .. length( $chrom ) - 10 ) { exists $probes{ $p = substr( $chrom, $_, 10 ) } and print "$p : $_ +"; } warn time;

    On randomly generated data this method took just 71 seconds to test 30MB against 1e6 10-char probes. It required ~ 90MB of ram to build the hash.

    If your probes are different lengths, then you would need multiple substrs, or a loop from min to max length, but it should still show a marked improvement in performance. Update: It does. With 1e6 probes of 8 through 12-chars each, it takes almost exactly 5 times as long for a total time of 335 seconds.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: pattern match, speed problem
by hipowls (Curate) on Feb 20, 2008 at 06:42 UTC

    For matching fixed substrings index is faster. It just requires a slight modification to the code

    my $position = index $chrom, $_; if ( $position != -1 ) { # we matched }

        Or lower case each probe string, that's probably cheaper. Of course if the original data was all the same case then some overhead can be avoided. If the probe strings or the large string are used multiple times then it may be worthwhile preprocessing the data.

        perl -pe'tr/acgt/ACGT/' -i big_string_file
        or
        perl -pe'tr/ACGT/acgt/' -i probe_string_file

Re: pattern match, speed problem
by moritz (Cardinal) on Feb 20, 2008 at 10:59 UTC
    Instead of iterating over the probes, build a single regex:
    my $re = join '|', @probes; $re = qr{(?:$re)}; # just in case you want to use it # somewhere else, and now it's compiled.

    And make sure that you use perl 5.10, which has a special optimization in the regex engine for alternations of fixed strings.

    Note that this won't find overlapping match results, if you need them, you have to fiddle with pos (and perhaps \G in the regex) manually.

    It will also modify the order in which the matches are returned, but if you need the original order very badly, you can find ways around it.

      It's a seductive idea, but it doesn't seem to scale. With 1000 probes and 30MB it take sunder 2 seconds. With 10,000 probes it takes 22 seconds; 11,000 - 23 secs; 12,000 - 24 secs. Everything looks good. But then with 13,000 I left it running for nearly 10 minutes and it still had not completed and the output seemed to have redcued to a trickle?

      I also tried 15,000 & 20,000 in case I had hit some odd pathelogical case, but both never seem to complete?


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        Did you try it with perl 5.10 or newer? (sorry, don't have one here available to test it).

        Anyway, you could join just 1000 probes into one regex, and repeat the process until all probes are tested against. That's certainly much faster than single runs.

        Update: I thougt a bit more about it, and got the following idea: if you have many probes, and use a fixed number of probes per regex, you can try to sort the probes before building the buckets. That way you can play to the strengths of the re engine and its trie optimization.

Re: pattern match, speed problem
by stiller (Friar) on Feb 20, 2008 at 08:39 UTC
    index has been suggested, also if case is not interesting, make both inputfiles all upper or all lower case only once is much faster than ignoring case for each try. This problem also lends itself nicely to concurrency, if you have hardware for it (e.g. several machines, several processors and/or kernels. Divide the probes in a number of partitions, and set off a process seeking for each partition of the problem.

    Also, it seems likely that somebody has optimized this kind of search. For example, if the only letters are C,G,A and T, then your full 8 bit string is taking a lot more space and time to seach that if you convert it to 2 bit/byte (eg 00 = C, 01 = G, 10 = A, 11 = T). But this is not my area...
    hth

Re: pattern match, speed problem
by ysth (Canon) on Feb 20, 2008 at 07:41 UTC
    Your /g flag is making each match start looking only after where the previous one left off. Is this intentional?
Re: pattern match, speed problem
by bobf (Monsignor) on Feb 21, 2008 at 05:09 UTC

    index has already been mentioned, but I wanted to comment on this:

    Only one hit is possible for each 'probe' string

    It is possible that the probes have already been screened for uniqueness, but based on a frequency of 0.25 for each base (on average and subject to inter-species variation, of course) a 10-mer would occur by chance once every 1,048,576 bases (4**10). Your chromosomes are 30 million bases long, so in the absence of other information I'd expect each probe to match quite a few times on each chromosome.

    That said, note that the third parameter of index sets the start position for searching the string. To find all matches you'll have to use index iteratively, m//g, or (the approach I prefer, BrowserUk++) create an index of the chromosomes before searching for your 1 million probes.