Re: counting the number of 16384 pattern matches in a large DNA sequence
by BrowserUk (Patriarch) on Jun 14, 2012 at 16:13 UTC
|
getting the count of 16384 patterns in a large string(sequence)
You say "16384 patterns", but you do not show us what those patterns look like?
You say "in a large sequence", but you are processing a hash of sequences. How many sequences are in the hash? How long are they?
You say "the code is very slow", without saying how slow, or what order of improvement you are seeking.
You are slowing your code down by chomping inside the inner loop, which will be redundant after the first time.
You are also studying the sequences -- a slow process, that may or may not be benefitting you -- inside the inner loop; which means that you will study each sequence once for each pattern. There is no benefit to studying a string more than once.
If you are serious about improving the performance of this code, you will need to supply considerably more information. A few sample patterns and a sample sequence -- are they just ACGT; or do they contains the full range of nucleic acid codes ACGTURYKMSWBDHVNX-; or the full range of amino acid codes: ABCDEFGHIJKLMNOPQRSTUVWYZX*-; ?
The better teh info you supply, the more likely that you will get an effective solution.
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
| [reply] |
|
There is no benefit to studying a string more than once.
Just as a matter of curiosity, have you ever come across a case in which there was a benefit from studying a string even once?
Occasionally and informally, I have Benchmark-ed a use of study that I thought might be beneficial per the description in the docs: matching many regexes against a single, unchanging string. I have never seen any benefit. (I must admit I have never tested the specific example given in the perlfunc docs for which a potential "big win" is claimed.) Has regex optimization reached a point at which we can say "... I ain't gonna study [strings] no more"?
| [reply] |
|
Just as a matter of curiosity, have you ever come across a case in which there was a benefit from studying a string even once?
Yes. But it was
- a while ago on much slower hardware than I now run.
Boyer-Moore worked best in the days before the chip manufacturers had built string search algorithms into microcode; and before the advent of deeply pipelined architectures with branch predictions, when branching had a proportionally significantly greater impact on throughput than with modern hardware.
- the needle being search for contained a character that was very rare in the haystack being searched.
This allows the algorithm to trade intelligence for brute force.
With modern processors that overlap fetches and branching stalls with other parts of the opcodes, the benefits are greatly curtailed.
I tried to look up a old post of mine that benchmarked study having a beneficial effect, but couldn't find it.
I also tried to replicate my memory of what I was doing back then, but cannot reproduce it. Whether that's because my memory is bad, or my modern processor negates the benefits I'm not totally sure, but I suspect the latter.
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
| [reply] |
|
As of Perl 5.16.0: "study is now a no-op, presumably fixing all outstanding bugs related to study causing regex matches to behave incorrectly!"
perldelta: Other Notable Fixes
It's a shame that it's not mentioned in the POD for study directly, but really not much of a shame to see the function fade away. The few times I attempted to use it, benchmarking showed that I still hadn't found a good application for it.
| [reply] |
|
but really not much of a shame to see the function fade away
Agreed. Whilst Boyer-Moore (and some of the others: Aho-Corasick; Rabin-Karp; etc.) can be beneficial for specific tasks and algorithms -- spam filters; virus detection -- they rarely live up to their promise for general purpose work. As such, the complications they add to the runtime aren't worth it for the rare occasions when they are beneficial. Especially on modern hardware with big caches and pipelines.
For genomic purposes (with its often very small alphabet), it is quite trivial to setup a specialist indexing mechanism that shows far greater benefits.
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
| [reply] |
Re: counting the number of 16384 pattern matches in a large DNA sequence
by kennethk (Abbot) on Jun 14, 2012 at 16:14 UTC
|
First, since you are talking about performance characteristics, appropriate optimization depends on real use cases; this means we'd need not only your code, but some characteristic input. Then, you should use a profiler (like Devel::NYTProf) in order to accurately identify what's taking all your time.
On casual examination of your code, I note that you call study multiple times (once for every value of @string) on your %seq values. I haven't used study myself, but it seems like moving that outside your loop would buy you some time, e.g.
study for values %seq;
foreach my $string (@string) { ...
#11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.
| [reply] [d/l] [select] |
|
AAAAAAA
AAAAAAT
AAAAAAG
AAAAAAC
AAAAATA
AAAAATT
AAAAATG
AAAAATC
AAAAAGA
AAAAAGT
AAAAAGG
Another input file is a multi fasta file having sequences for each chromosome
like this:
>chr1
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
so i have stored the sequence for each chromosome in a hash with hash key as chromosome number and concatenated sequence as value.The number of lines in the multifasta file is 61913917.Thanks
| [reply] [d/l] [select] |
|
AAAAAAA
AAAAAAT
AAAAAAG
AAAAAAC
Are all your patterns the same length? Are they all upper case? Are you looking for exact (including case) matches only?
Is it possible to obtain a copy of the patterns file?
A small extract from a fasta file I have kicking around (HG:chr20):
...
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
GATCCAgaggtggaagaggaaggaagcttggaaccctatagagttgctga
gtgccaggaccagatcctggccctaaacaggtggtaaggaaggagagagt
gaaggaactgccaggtgacacactcccaccatggacctctgggatcctag
ctttaagagatcccatcacccacatgaacgtttgaattgacagggggagc
...
index is usually faster for matching constant strings, but if you you want case independent matches, you would need to uc the sequences before searching (and studying).
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
| [reply] [d/l] [select] |
|
|
|
| [reply] |
|
|
|
|
One of the inputs is a file of 16384 strings given in below example format:
Given a string of 7 fasta characters, you can form 4^7= 16384 unique strings from T, A, G, C. So, his algorithm is checking every possible 7 char fasta string against the mutifasta file (number of lines in the multifasta file is 61913917).
| [reply] |
|
|
|
Re: counting the number of 16384 pattern matches in a large DNA sequence (100x faster?)
by BrowserUk (Patriarch) on Jun 15, 2012 at 00:11 UTC
|
sub gen;
sub gen{
return @_[1..$#_] if $_[0] == 1;
map{
my $p=$_;
map{ $p . $_ } gen( $_[0]-1, @_[1..$#_] )
} @_[1..$#_]
}
my %seqs = ...;
my @patterns = gen( 7, qw[A C G T] );
my %counts;
for my $seq ( values %seqs ) {
++$counts{ substr $seq, $_, 7 } for 0 .. length( $seq )-7;
}
print "$_ ::= $counts{ $_ }" for @patterns;
In my experiments on a 49 million base pairs sequence: [ 0:15:31.00] C:\test\humanGenome>..\junk999 chr21.fa
16384 patterns.
49092500 base pairs
Using custom indexing found 35106546 matches; took 34.536852 seconds
Using custom index2 found 35106546 matches; took 31.354438 seconds
Simple search found 35106546 matches; took 2970.517883 seconds
it was close to 100 times faster than your current method. YMMV.
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
| [reply] [d/l] [select] |
|
| [reply] |
|
my %seqs = ...;
With your code that loads the hash with your sequences?
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
| [reply] [d/l] [select] |
|
#! perl -slw
use strict;
use Time::HiRes qw[ time ];
sub gen{
return @_[1..$#_] if $_[0]==1;
map{
my $p=$_;
map{ $p . $_ } gen($_[0]-1, @_[1..$#_] )
} @_[1..$#_]
}
our $N //= 7;
my %seqs = map {
if( length() ) {
my( $id, @seq ) = split "\n", $_;
$id => join '', @seq;
}
else { () }
} split '>', do{ local $/; uc( <DATA> ) };
my $start = time;
my %counts;
for my $seq ( values %seqs ) {
++$counts{ substr $seq, $_, $N } for 0 .. length( $seq ) -$N;
}
print "Took ", time - $start;
print "$_ ::= ", $counts{ $_ } // 0 for gen( $N, qw[A C G T] );
__DATA__
> DNA1
GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCAT
TTGGTATTTTCGTCTGGGGGGTGTGCACGCGATAGCATTGCGAGACGCTG
GAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTGCCTCATT
CTATTATTTATCGCACCTACGTTCAATATTACAGGCGAACATACCTACTA
> DNA2
AAGTGTGTTAATTAATTAATGCTTGTAGGACATAATAATAACAATTGAAT
GTCTGCACAGCCGCTTTCCACACAGACATCATAACAAAAAATTTCCACCA
AACCCCCCCCTCCCCCCGCTTCTGGCCACAGCACTTAAACACATCTCTGC
CAAACCCCAAAAACAAAGAACCCTAACACCAGCCTAACCAGATTTCAAAT
TTTATCTTTAGGCGGTATGCACTTTTAACAGTCACCCCCCAACTAACACA
> DNA3
TTATTTTCCCCTCCCACTCCCATACTACTAATCTCATCAATACAACCCCC
GCCCATCCTACCCAGCACACACACACCGCTGCTAACCCCATACCCCGAAC
CAACCAAACCCCAAAGACACCCCCCACAGTTTATGTAGCTTACCTCCTCA
AAGCAATACACTGAAAATGTTTAGACGGGCTCACATCACCCCATAAACAA
ATAGGTTTGGTCCTAGCCTTTCTATTAGCTCTTAGTAAGATTACACATGC
AAGCATCCCCGTTCCAGTGAGTTCACCCTCTAAATCACCACGATCAAAAG
> DNA4
AGCATTACTTATATGATATGTCTCCATACCCATTACAATCTCCAGCATTC
CCCCTCAAACCTAAGAAATATGTCTGATAAAAGAGTTACTTTGATAGAGT
AAATAATAGGAGCTTAAACCCCCTTATTTctaggactatgagaatcgaac
ccatccctgagaatccaaaattctccgtgccacctatcacaccccatcct
aAAGTAAGGTCAGCTAAATAAGCTATCGGGCCCATACCCCGAAAATGTTG
GTTATACCCTTCCCGTACTAATTAATCCCCTGGCCCAACCCGTCATCTAC
And some output: C:\test>976237-2 -N=2
Took 0.000387907028198242
AA ::= 97
AC ::= 86
AG ::= 49
AT ::= 84
CA ::= 99
CC ::= 130
CG ::= 26
CT ::= 71
GA ::= 35
GC ::= 43
GG ::= 27
GT ::= 36
TA ::= 85
TC ::= 68
TG ::= 39
TT ::= 71
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
| [reply] [d/l] [select] |
|
| [reply] |
Re: counting the number of 16384 pattern matches in a large DNA sequence
by CountZero (Bishop) on Jun 14, 2012 at 18:27 UTC
|
You could try to match for the 16384 patterns in one go by building one big regex with Regexp::Assemble.
CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James My blog: Imperial Deltronics
| [reply] |
Re: counting the number of 16384 pattern matches in a large DNA sequence
by jwkrahn (Abbot) on Jun 14, 2012 at 19:09 UTC
|
@string = <IN1>;
chomp @string;
while ( my $string = <IN1> ) {
chomp $string;
my ( $value ) = values %seq;
my $result = () = $value =~ /$string/ig;
print "$string => $result\n";
}
| [reply] [d/l] [select] |
|
while (my $string = <IN1>) {
chomp $string;
#my ($value) = values %seq;
foreach $value (values %seq) {
my $result = () = $value =~ /$string/g;
print "$string => $result\n";
}
}
| [reply] [d/l] |
|
| [reply] |