Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

Re: Fast common substring matching

by BrowserUk (Patriarch)
on Aug 24, 2005 at 13:39 UTC ( [id://486178]=note: print w/replies, xml ) Need Help??


in reply to Fast common substring matching

You're still not locating all equal-length LCSs.

P:\test>type duptest.dat >string1 AAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGTTTTTTTTTTTTTTTTT >string2 TTTTTTTTTTTTTTTTTCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCAAAAAAAAAAAAAAAAA >string3 TTTTTTTTTTTTTTTTTxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxAAAAAAAAAAAAAAAAA P:\test>gf duptest.gf 000:001 L[ 17] ( 0 47) 000:002 L[ 17] ( 0 47) 001:002 L[ 17] ( 0 0) Completed in 0.001205 Best match: >string1 - >string2. 17 characters starting at 0 and 47. Best match: >string1 - >string3. 17 characters starting at 0 and 47. Best match: >string2 - >string3. 17 characters starting at 0 and 0.

Each pairing contains two equal matches:

P:\test>484593-5 duptest.dat 000:001 L[017] (0000,0047)'TTTTTTTTTTTTTTTTT' (0047,0000)'AAAAAAAAAAAAAAAAA' 000:002 L[017] (0000,0000)'TTTTTTTTTTTTTTTTT' (0047,0047)'AAAAAAAAAAAAAAAAA' 001:002 L[017] (0000,0047)'AAAAAAAAAAAAAAAAA' (0047,0000)'TTTTTTTTTTTTTTTTT' 3 trials of duptest.dat ( 323us total), 107us/trial

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
"Science is about questioning the status quo. Questioning authority".
The "good enough" maybe good enough for the now, and perfection maybe unobtainable, but that should not preclude us from striving for perfection, when time, circumstance or desire allow.

Replies are listed 'Best First'.
Re^2: Fast common substring matching
by GrandFather (Saint) on Aug 24, 2005 at 21:16 UTC

    It's a special case and I don't think it is a problem in practice. It only happens when there is a block longer than $subStrSize (the minimum match quanta) with a repeated pattern. Test strings and results are shown below:

    >string1 01010101010ddddddddddd01234566789a12345yy >string2 0123456789b12345eeeeeeeeeeeex01010101010x >string3 0123456789c12345ffffffffffff01010101010zz 000:001 L[ 11] ( 0 29) 000:002 L[ 11] ( 0 28) 001:002 L[ 10] ( 0 0) Completed in 0.002126 Best match: >string1 - >string2. 11 characters starting at 0 and 29. Best match: >string1 - >string3. 11 characters starting at 0 and 28.

    Perl is Huffman encoded by design.

      Okay, but I'm not sure whether you can safetly discount the possibility of long sequences of repeated characters, even in biodata. The following shows a scan of the complete drosophila (fruit fly) genome looking for sequences of repeated DNA characters:

      >perl -nlwe" print $1 while m[((.)\2+)]g" na_clones.dros.RELEASE2.5 | +( More? perl -nle"$b{ length($_) }{ chop $_}++ } { printf qq[length %d :[ A:%d C:%d G:%d T:%d ]\n], $_, @{$b{$_}}{'A','C','G','T'} for sort{$b<=>$a} keys %b" ) length 50 :[ A:1 C:0 G:0 T:0 ] length 47 :[ A:1 C:0 G:0 T:0 ] length 45 :[ A:1 C:0 G:0 T:0 ] length 44 :[ A:0 C:0 G:0 T:0 ] length 43 :[ A:0 C:0 G:0 T:1 ] length 42 :[ A:1 C:0 G:0 T:0 ] length 41 :[ A:1 C:0 G:0 T:1 ] length 40 :[ A:1 C:0 G:0 T:0 ] length 39 :[ A:0 C:0 G:0 T:0 ] length 38 :[ A:1 C:0 G:0 T:0 ] length 37 :[ A:2 C:0 G:0 T:0 ] length 36 :[ A:5 C:0 G:0 T:4 ] length 35 :[ A:5 C:0 G:0 T:3 ] length 34 :[ A:2 C:0 G:0 T:3 ] length 33 :[ A:2 C:0 G:0 T:3 ] length 32 :[ A:1 C:0 G:0 T:1 ] length 31 :[ A:5 C:0 G:0 T:5 ] length 30 :[ A:4 C:0 G:0 T:5 ] length 29 :[ A:10 C:0 G:0 T:6 ] length 28 :[ A:12 C:0 G:0 T:13 ] length 27 :[ A:19 C:0 G:0 T:12 ] length 26 :[ A:26 C:0 G:0 T:28 ] length 25 :[ A:35 C:1 G:1 T:34 ] length 24 :[ A:48 C:0 G:0 T:47 ] length 23 :[ A:63 C:0 G:1 T:59 ] length 22 :[ A:78 C:1 G:0 T:84 ] length 21 :[ A:109 C:5 G:3 T:91 ] length 20 :[ A:126 C:5 G:6 T:144 ] length 19 :[ A:188 C:17 G:15 T:148 ] length 18 :[ A:240 C:25 G:26 T:314 ] length 17 :[ A:411 C:45 G:55 T:389 ] length 16 :[ A:647 C:73 G:78 T:656 ] length 15 :[ A:905 C:122 G:133 T:886 ] length 14 :[ A:1267 C:191 G:216 T:1208 ] length 13 :[ A:1813 C:255 G:260 T:1805 ] length 12 :[ A:2702 C:353 G:353 T:2800 ] length 11 :[ A:4209 C:380 G:404 T:4184 ] length 10 :[ A:7181 C:446 G:428 T:7343 ] length 9 :[ A:9964 C:386 G:437 T:10011 ] length 8 :[ A:14581 C:699 G:601 T:14514 ] length 7 :[ A:28336 C:2656 G:2644 T:28782 ] length 6 :[ A:90274 C:11644 G:11543 T:89663 ] length 5 :[ A:308601 C:52038 G:52017 T:309011 ] length 4 :[ A:889188 C:199827 G:200919 T:885567 ] length 3 :[ A:2424590 C:938297 G:935629 T:2422964 ] length 2 :[ A:6232754 C:4721693 G:4718156 T:6219367 ]

      As you can see, at lengths over 16, they are still a fairly frequent occurence. And even at lengths > 32, there are still enough to be worrying.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
      "Science is about questioning the status quo. Questioning authority".
      The "good enough" maybe good enough for the now, and perfection maybe unobtainable, but that should not preclude us from striving for perfection, when time, circumstance or desire allow.

        The down side is that it is not only a single character repeated ('AAAA'), but short repeating sequences ('ACTACTACT') that can be missed or truncated. The up side is that for bioMan's problem a minimum match quanta of 128 is probably optimum and I'd guess that that is long enough to be unlikely to be a problem.

        At this time I've not thought of a fast way of dealing with the issue and am somewhat inclined to ignore it unless someone can convince me that this is really useful code, but needs this bug fixed.


        Perl is Huffman encoded by design.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://486178]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others examining the Monastery: (4)
As of 2024-03-28 15:38 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found