Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

Multiple Regex's on a Big Sequence

by bernanke01 (Beadle)
on Aug 16, 2006 at 06:39 UTC ( #567611=perlquestion: print w/replies, xml ) Need Help??

bernanke01 has asked for the wisdom of the Perl Monks concerning the following question:

Hello,

I'm just writing a script that will perform a number of different, very simple regular expressions on a large (up to 300 million characters) string. The searches are really simple, things like:

while ( $sequence =~ /CACGTG/g ) { print join("\t", $chr, '+', pos($sequence)-6), "\n"; } while ( $sequence =~ /GTGCAC/g ) { print join("\t", $chr, '-', pos($sequence)-6), "\n"; }

I have two questions about this. First, am I right to be doing these searches sequentially? Or ought I be trying to combine them all into one single regular expression? I chose the sequential approach because it seemed far more readable and easily scalable, but I didn't know if there was any reason to believe it was horribly inefficient or silly.

Second, is my use of pos($sequence) - match_length to find the starting point of the match appropriate, or is there a simpler way of getting the match start?

Many thanks in advance for any thoughts/advice/comments.

Replies are listed 'Best First'.
Re: Multiple Regex's on a Big Sequence
by ikegami (Pope) on Aug 16, 2006 at 06:47 UTC

    index is usually faster at searching for a constant string than a regexp, but I haven't timed /g against its index equivalent. Here's the (untested) code:

    local $, = "\t"; local $\ = "\n"; foreach my $short (qw( CACGTG GTGCAC )) { my $pos = 0; my $len = length($short); while (($pos = index($sequence, $short, $pos)) >= 0) { print $chr, '+', substr($sequence, $pos, $len); $pos += $len; } }

    As an added bonus, replace $pos += $len; with $pos++; to allow overlapping matches.

    If you do use regexps, @- and @+ can be used instead of pos. Specifically, substr($sequence, $-[0], $+[0] - $-[0]) will return the matched text. See perlvar.

      Ahh, I never even thought of index. And it has the nice advantage of being parallelizable across CPUs just like a regex. I'll benchmark the three approaches (index, regex, and combined regex's) and report back.
Re: Multiple Regex's on a Big Sequence
by borisz (Canon) on Aug 16, 2006 at 07:00 UTC
    Combiming the patterns looks easy, And if you need to run diffence code for differnt patterns put them in a hash with the pattern as the key and the coderef as the value.
    use Regexp::Assemble; my @search_patterns = ( 'CACGTG', 'GTGCAC' ); my $ra = Regexp::Assemble->new; $ra->add( @search_patterns ); while ( $seq =~ /$ra/g ) { ... }
    Boris
      Ahh, never heard of this module, that's cool! The one thing I should have mentioned is that I'm on a multi-processor system so a reg-ex or index approach might be more parallelizable. I think I should benchmark all the options though.
Re: Multiple Regex's on a Big Sequence
by lima1 (Curate) on Aug 16, 2006 at 12:35 UTC
    do you want to search for (nearly) exact matches in a chromosom? Then you really should use a suffix array implementation. It is O(m + log(n)) where m is query size and n is sequence size. So probably MUCH faster than your solution. /msg me if you are interested and I send you my mature vmatch module.
Re: Multiple Regex's on a Big Sequence - Benchmark
by bernanke01 (Beadle) on Aug 16, 2006 at 22:55 UTC

    I've done some benchmarking of the three options (not including lima1's suggestion of suffix-trees): sequential regular expressions, sequential index's, and merging the regular expressions together using Regexp::Assemble. I'm not great at writing benchmarks, so perhaps this script is simplistic, and I'd appreciate comments on how to improve it. Nevertheless, here are the results:

    Regexp : 230 wallclock secs (229.86 usr + 0.06 sys = 229.92 CPU) Index : 22 wallclock secs (21.47 usr + 0.01 sys = 21.48 CPU) Merged : 663 wallclock secs (663.73 usr + 0.17 sys = 663.90 CPU)

    So it appears that even without forking the indexing approach is much faster for exact matches. Here's the benchmarking code I used.

    use strict; use Bio::SeqIO; use Benchmark; use Regexp::Assemble; use Carp; my $seqio = Bio::SeqIO->new( -file => "chrY.fa.masked", -format => "fasta" ); my $seq = $seqio->next_seq(); my $sequence = $seq->seq(); my @strings = ( 'CACGTG', 'GTGCAC' ); my $t0 = new Benchmark; for (1..100) { foreach my $string (@strings) { my $len = length($string); while ($sequence =~ /$string/g) { my $val = pos($seque +nce) - $len; } } } my $t1 = new Benchmark; my $t2 = new Benchmark; for (1..100) { foreach my $string (@strings) { my $pos = 0; my $len = length($string); while ( ($pos = index($sequence, $string, $pos)) >= 0 +) { my $val = substr($sequence, $pos, $len); $pos += $len; } } } my $t3 = new Benchmark; my $t4 = new Benchmark; for (1..100) { my $ra = Regexp::Assemble->new(); $ra->add(@strings); while ( $sequence =~ /$ra/g) { my $val = pos($sequence) - 6; } } my $t5 = new Benchmark; my $td1 = timediff($t1, $t0); my $td2 = timediff($t3, $t2); my $td3 = timediff($t5, $t4); print 'Regexp : ', timestr($td1), "\n"; print 'Index : ', timestr($td2), "\n"; print 'Merged : ', timestr($td3), "\n";

      For the cases where you compare multiple regexps against your target string, it may save time if you also study($sequence) before starting the matches.

      This will do a scan of the sequence to allow subsequent matches to use the Boyer-Moore algorithm - it builds a linked list of the locations of each different character in the sequence, and then takes advantage of the frequency data to pick the rarest character for which to walk the list.

      Because the main benefit of this approach is about rarity, it may not be a big win for a case like this where the string uses only a 4-character alphabet, and (presumably) uses each character roughly 1/4 of the time; I'd be interested to see how it affects the benchmarks.

      Hugo

        Great idea, I'll add it to the next round of Benchmarks.
      That looks wrong to me, since therre is no need to constuct the $ra 100 times. Try it this way. Also try it with more patterns or do you search only for two?
      ... my $t4 = new Benchmark; my $ra = Regexp::Assemble->new(); $ra->add(@strings); for (1..100) { while ( $sequence =~ /$ra/g) { my $val = pos($sequence) - 6; } }
      Boris

        Hey, I'll try that. However, I figured that if it were a timing test I'd have to compare 100 repeats of everything, and that included construction of the RA. Otherwise it's an apples-to-oranges comparison where one method had the chance to do up-front work and the others didn't. I'll post benchmarks on that in a few minutes though and we can see if it makes a difference.

      Posting the results would be good. Otherwise people would have to go through the nightmare that is installing Bio::* in order to get them.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

        Hi BrowserUk -- is there more to posting the results than giving the summary stats I gave at the top of the node? My benchmark didn't give me any more than that, actually. This is the text that's at the top of the grand-parent that I'm speaking of:

        Regexp : 230 wallclock secs (229.86 usr + 0.06 sys = 229.92 CPU) Index : 22 wallclock secs (21.47 usr + 0.01 sys = 21.48 CPU) Merged : 663 wallclock secs (663.73 usr + 0.17 sys = 663.90 CPU)
Re: Multiple Regex's on a Big Sequence
by bernanke01 (Beadle) on Aug 18, 2006 at 02:45 UTC

    Here's a second round of benchmarking. I've added two things -- the suggested removal of object creation from the loop and a straight-forward regex with string studying. Here is the code, with the benchmarks at the bottom this time. :) My interpretation is that there isn't an appreciable difference caused by the object instantiation, nor by studying. I suspect the slight difference in run-time are more likely just the result of increased server load while the benchmark script was running.

    use strict; use Bio::SeqIO; use Benchmark; use Regexp::Assemble; use Carp; my $seqio = Bio::SeqIO->new( -file => "chrY.fa.masked", -format => "fa +sta" ); my $seq = $seqio->next_seq(); my $sequence = $seq->seq(); my @strings = ( 'CACGTG', 'GTGCAC' ); my $t0 = new Benchmark; for (1..100) { foreach my $string (@strings) { my $len = length($string); while ($sequence =~ /$string/g) { my $val = pos($seque +nce) - $len; } } } my $t1 = new Benchmark; my $t2 = new Benchmark; for (1..100) { foreach my $string (@strings) { my $pos = 0; my $len = length($string); while ( ($pos = index($sequence, $string, $pos)) >= 0 +) { my $val = substr($sequence, $pos, $len); $pos += $len; } } } my $t3 = new Benchmark; my $t4 = new Benchmark; for (1..100) { my $ra = Regexp::Assemble->new(); $ra->add(@strings); while ( $sequence =~ /$ra/g) { my $val = pos($sequence) - 6; } } my $t5 = new Benchmark; my $t6 = new Benchmark; my $ra = Regexp::Assemble->new(); $ra->add(@strings); for (1..100) { while ( $sequence =~ /$ra/g) { my $val = pos($sequence) - 6; } } my $t7 = new Benchmark; my $t8 = new Benchmark; for (1..100) { study($sequence); foreach my $string (@strings) { my $len = length($string); while ($sequence =~ /$string/g) { my $val = pos($seque +nce) - $len; } } } my $t9 = new Benchmark; my $td1 = timediff($t1, $t0); my $td2 = timediff($t3, $t2); my $td3 = timediff($t5, $t4); my $td4 = timediff($t7, $t6); my $td5 = timediff($t9, $t8); print 'Regexp : ', timestr($td1), "\n"; print 'Index : ', timestr($td2), "\n"; print 'Merged object inside loop : ', timestr($td3), "\n"; print 'Merged object out of loop : ', timestr($td4), "\n"; print 'Regexp with studying : ', timestr($td5), "\n";

    And the results:

    Regexp : 221 wallclock secs (221.08 usr + 0.05 sys + = 221.13 CPU) Index : 22 wallclock secs (21.49 usr + 0.00 sys = + 21.49 CPU) Merged object inside loop : 636 wallclock secs (636.61 usr + 0.21 sys + = 636.82 CPU) Merged object out of loop : 651 wallclock secs (651.06 usr + 0.18 sys + = 651.24 CPU) Regexp with studying : 240 wallclock secs (239.50 usr + 0.27 sys + = 239.77 CPU)

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://567611]
Approved by bobf
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others surveying the Monastery: (4)
As of 2021-05-18 14:39 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Perl 7 will be out ...





    Results (180 votes). Check out past polls.

    Notices?