Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

Seeking algorithm for finding common continous sub-patterns

by johnnywang (Priest)
on Dec 03, 2004 at 07:06 UTC ( #412061=perlquestion: print w/ replies, xml ) Need Help??
johnnywang has asked for the wisdom of the Perl Monks concerning the following question:

Seeking wisdom: I have a set of number sequences, given a length parameter n, and minimum occurance parameter m, I'd like to find common continuous subsequences of length n that occur at least m times. Here's an example:
@a = ( [2,5,10,5,12,6,21,5,10,12,23], [5,6,11,10,5,10,6,21,5,1,9], [6,5,10,15,21] ); $m = 2; $n = 2;
I'd like the result to be the set containing the following sequences (hopefully I haven't missed any):
(5,10) #which occured twice in 1st, once in 2nd and 3rd (10,5) # which occured once in 1st and 2nd (6,21,5) # which occured once in 1st and 2nd
performance is a concern since both the length and the number of sequences can be pretty big, say 1000. Thanks.

Update: I meant "at least length n". The subsequence does need to be in different sequences, but I'd like to have a total count including multiple occurances within the same sequence.

Comment on Seeking algorithm for finding common continous sub-patterns
Select or Download Code
Re: Seeking algorithm for finding common continous sub-patterns
by Jaap (Curate) on Dec 03, 2004 at 07:58 UTC
    What did you come up with so far? How slow is the no-brainer way? What performance are you looking for?
    Edit:
    Should they occur m times in different "subarrays"?
Re: Seeking algorithm for finding common continous sub-patterns
by davidj (Priest) on Dec 03, 2004 at 08:47 UTC
    Johnnywang,

    If I understand the problem correctly, then from an algorithmic standpoint, what you are looking for is unrealistic, at least to be solved in any reasonable amount of time. The problem is this: the only way to guarantee finding every common continuous subsequence of any length is to do it by brute force. You have to check every subsequence from size m up to the size of the smallest sequence.

    In the example you give, you have to check every subsequence from size 2 to size 5. I did a quick math check (so the exact number might be wrong), but the number of check would be in the neighborhood of 600. And this is with only 3 sequences of sizes 11, 11, and 5. Just adding 1 number to each of the 3 sequences would add hundreds of more required checks. The growth of the problem with every additional sequence is polynomial.

    For you to want to check 1000's of sequences that may be 1000's of numbers long would require in the billions of required checks which would take much longer than I'm sure you are willing to accept. This is one of those NP problems that CS majors learn about in algorithm classes (traveling salesman, coloring problem, hamiltonian circuit problem, etc).

    If someone can come up with an algorithm that is not brute force, thus making the problem not NP, I would love to see it.

    davidj
      davidj, I had the same general feeling, so I didn't even try the brute force method, convinced it's going to take too long.
Re: Seeking algorithm for finding common continous sub-patterns
by ysth (Canon) on Dec 03, 2004 at 09:20 UTC
    So you mean subsequences of at least length n? Do the m occurences have to be in different elements of @a? If so, do the elements have to be adjacent? (I'm trying to figure out what you mean by 'continuous'.)

    Update: If you really meant at least, would you want (1,2,3,1,2,4,1,2,4) to show (1,2,4) and (1,2) or just (1,2,4)?

    Update: assuming "continuous" only meant not considering (1,3,5) to be a subsequence of (1,2,3,4,5), and that you were just abbreviating the fact that there were matching (6,21) and (21,5) by saying (6,21,5), and assuming all your data are integers > 0:

    use warnings; use strict; no warnings "utf8"; my @a = ( [2,5,10,5,12,6,21,5,10,12,23], [5,6,11,10,5,10,6,21,5,1,9], [6,5,10,15,21] ); my $m = 2; my $n = 2; my $big = join "\0", map join('', map chr, @$_), @a; $big =~ tr/\0\n/\n\0/; my %uniq; my @m2 = map [map ord||10, split //], grep !$uniq{$_}++, $big =~ /(?=(.{$n})(?s:.*?\1){${\($m-1)}})./g; use Data::Dumper; print Data::Dumper->new([$_])->Terse(1)->Indent(0)->Dump(), "\n" for @m2;
    but it doesn't scale well (but may scale as well as any other solution.)
      You're all correct: length at least n, in m different sequences, and adjacent (ordered).
Re: Seeking algorithm for finding common continous sub-patterns
by tachyon (Chancellor) on Dec 03, 2004 at 09:33 UTC

    Maybe this will be fast enough for you. Maybe you need to be a bit more specific with your spec. This will not scale very well. The 27 element data set consumes 120 loops in the generation phase and 34 in the output phase for a total of 154. It is roughly O(n^2) but it is quite data dependent.

    I wrote Algorithm::LCSS which is based on Algorithm::Diff and may be a better option depending on the real task. The problem with that approach is that it is a one to one comparison not many to many which is what you seem to want.

    my @a = ( [2,5,10,5,12,6,21,5,10,12,23], [5,6,11,10,5,10,6,21,5,1,9], [6,5,10,15,21] ); my $m = 2; my $n = 2; my %h; for my $ref (@a) { for my $i ( 0 .. (@$ref-$n) ) { for my $j ( ($i+$n-1) .. @$ref-1 ) { $h{ join ',', (@$ref)[$i..$j] }++; } } } for my $seq( sort { $h{$b}<=>$h{$a} } keys %h ) { print "$h{$seq}: $seq\n" if $h{$seq} >= $m; } __DATA__ 4: 5,10 2: 6,21,5 2: 10,5 2: 6,21 2: 21,5

    cheers

    tachyon

Re: Seeking algorithm for finding common continous sub-patterns
by BrowserUk (Pope) on Dec 03, 2004 at 12:02 UTC

    Unless I misunderstood the question, this may be NP complete, but it doesn't take long. Leastwise not if all your numbers are small integers, < 255.

    What I did was to encode the arrays of integers into strings, and then build an index to all the N-char length substrings in the 1000 x 1000-char strings. This latter process only takes a few seconds.

    Less time infact than generating and encoding the test data. Reading it from a file and encoding it directly to strings would make it quicker still, by saving on the ram for all the arrays.

    It takes around 27 seconds (total: Generation, encoding, indexing, counting and printing to nul!) to process 1000 x 1000-number sets on my mid-range machine.

    #! perl -slw use strict; use Data::Dumper; $| = 1; $" = ', '; our $LEN ||= 1000; our $COUNT ||= 1000; our $M ||= 2; our $N ||= 2; my @data = map { [ map{ int rand 256 } 1 .. $LEN ] } 1 .. $COUNT; warn "Gen'd data"; my @encoded = map { pack 'C*', @$_; } @data; warn 'Encoded data'; my %seq; for my $i ( 0 .. $#encoded ) { for my $o ( 0 .. $LEN - $N ) { push @{ $seq{ substr $encoded[ $i ], $o, $N } }, $i; } } my $count = 0; for my $run ( keys %seq ) { next unless @{ $seq{ $run } } > $M; $count++; printf "Run [%*s] appears in lines: %s\n", $N * 4, join(',', map{ sprintf '%03d', $_ } unpack 'C*', $run ), "[@{ $seq{ $run } }]"; } warn "Found $count $N-number runs appearing in at least $M sets"; __END__ [11:54:20.15] P:\test>412061 -LEN=1000 -COUNT=1000 -N=2 -M=2 >nul Gen'd data at P:\test\412061.pl line 16. Encoded data at P:\test\412061.pl line 21. Found 65533 2-number runs appearing in at least 2 sets at P:\test\4120 +61.pl line 41. [11:54:47.06] P:\test>412061 -LEN=1000 -COUNT=1000 -N=3 -M=2 >nul Gen'd data at P:\test\412061.pl line 16. Encoded data at P:\test\412061.pl line 21. Found 553 3-number runs appearing in at least 2 sets at P:\test\412061 +.pl line 41. [11:55:16.46] P:\test> [11:55:23.76] P:\test>412061 -LEN=1000 -COUNT=1000 -N=3 -M=3 >nul Gen'd data at P:\test\412061.pl line 16. Encoded data at P:\test\412061.pl line 21. Found 9 3-number runs appearing in at least 3 sets at P:\test\412061.p +l line 41. [11:55:55.84] P:\test> [11:55:55.84] P:\test>412061 -LEN=1000 -COUNT=1000 -N=3 -M=3 Gen'd data at P:\test\412061.pl line 16. Encoded data at P:\test\412061.pl line 21. Run [ 215,017,145] appears in lines: [546, 657, 671, 986] Run [ 205,013,147] appears in lines: [326, 360, 385, 975] Run [ 134,174,216] appears in lines: [335, 512, 539, 592] Run [ 114,015,201] appears in lines: [470, 544, 557, 989] Run [ 065,233,176] appears in lines: [533, 656, 789, 930] Run [ 003,013,171] appears in lines: [17, 169, 460, 979] Run [ 191,241,173] appears in lines: [25, 88, 442, 782] Run [ 245,165,229] appears in lines: [373, 405, 599, 697] Run [ 072,156,003] appears in lines: [86, 153, 464, 544] Run [ 055,130,150] appears in lines: [73, 127, 157, 384, 868] Run [ 044,035,182] appears in lines: [3, 402, 734, 764] Run [ 187,058,006] appears in lines: [171, 487, 924, 944] Run [ 060,153,213] appears in lines: [72, 404, 708, 976] Run [ 057,200,093] appears in lines: [221, 246, 514, 908] Found 14 3-number runs appearing in at least 3 sets at P:\test\412061. +pl line 41. [12:04:02.68] P:\test>

    Examine what is said, not who speaks.
    "But you should never overestimate the ingenuity of the sceptics to come up with a counter-argument." -Myles Allen
    "Think for yourself!" - Abigail        "Time is a poor substitute for thought"--theorbtwo         "Efficiency is intelligent laziness." -David Dunham
    "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon
      As I reread his question, you may be correct. The problem is that his output does not match his stated conditions. According to his stated conditions, in the example he gives, he wants subsequences of length 2. However, his output includes a subsequence of length 3. For this reason, I am assuming that he misstated his conditions and means to say "I'd like to find common continuous subsequences of AT LEAST length n." If I am correct, then the time to solve an input of 1000's of sequences of 1000's of numbers would be prohibitively long, especially if n is small. If I am not correct, then you are closer to the actual runtime.

      Very good solution.

      davidj
        Sorry for the confusion, I did mean "at least length n". But it's not that bad, in my particular situation, I don't expect the length to be big, say the length of the subsequence is between 2 and 10.
Re: Seeking algorithm for finding common continous sub-patterns
by hv (Parson) on Dec 03, 2004 at 13:47 UTC

    Here's a simplistic approach using regexps:

    my @a = ( [2,5,10,5,12,6,21,5,10,12,23], [5,6,11,10,5,10,6,21,5,1,9], [6,5,10,15,21] ); my($m, $n) = (2, 2); my $joined = join ';', map join(',', @$_), @a; while ($joined =~ / \b (?= ( @{[ join ',', ('\d+') x $n ]} ) \b @{[ '(?> .*? \b \1 \b )' x ($m - 1) ]} ) /xg) { print "($1)\n" unless $seen{$1}++; }

    I suspect this isn't as fast as you'd need: on my machine, it took about 30 seconds to search an array of 100 x 100 random numbers for $m = 3, $n = 4, and runtime will be O(n^2) the total number of integers in your list of sequences.

    Also this finds only common subsequences of the exact length specified, so in this case it returns (5,10), (10,5), (6,21), (21,5). I can't see a way to adapt this to return directly only the longest common sequences, but you could maybe save all the length-2 sequences, then search for length-3 sequences and discard their subsequences, iteratively until you've found the longest subsequence.

    Hugo

Re: Seeking algorithm for finding common continous sub-patterns
by mattr (Curate) on Dec 04, 2004 at 18:12 UTC
    I cannot add much now to the interesting responses above, but would like to note that your problem has some resemblences to the problem of finding matching sequences in genomes.

    Also I found a nice paper of academic interest about sequence searching (not exactly your problem, but I couldn't resist as it has nice hashing of multidimensional tables at around section 4.3) It involves sliding a window over the target and storing the slopes of segments in very big hashes (if I understand as much as I read). I'd like to mess more with this but it's 3am here..

    If you do not need an exhaustive list of all pattern s but just the most interesting ones, or statistically significant ones allowing for a number of sequence errors, biological code is more interesting for you. You could look for example at how they do RepeatFinder at TIGR if curious.

    Also you could google about "hidden Markov", "interpolated Markov" or Viterbi which are used often to find hidden sequences or attempt predictions of what will come next in a sequence.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://412061]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others meditating upon the Monastery: (10)
As of 2014-08-28 13:03 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The best computer themed movie is:











    Results (260 votes), past polls