Clear questions and runnable codeget the best and fastest answer PerlMonks

### string matching problem

by baxy77bax (Chaplain)
 on Feb 09, 2012 at 15:45 UTC Need Help??
baxy77bax has asked for the wisdom of the Perl Monks concerning the following question:

Hello monks

I am turning to you after a long quest. I have been searching for the answer to a problem that does not allow me to sleep, eat or think. In past few months I have been struggling with a problem that has now turned to my obsession. Although many have tried to convince me that this is an NP problem ergo, has no practical solution, I have this gut feeling that if i somehow break the problem into small fragments i'll be able to solve it.

Now before i discourage you from even attempting to read the post i'll explain the problem

I am given a long string over the 3 character alphabet.

```S= acccabacbbacabcabcacabc....
[download]```
This string is usually 5-10 bilion characters long. What I am trying to figure out is how long is the shortest unique substring of that string on some position S(x) given that i allow k number of mismatches.
So let say that I am looking at the position S(10)
```S[10..13] = baca
[download]```
now this is the shortest unique substring on S(10), however the shortest unique substring on the same position allowing a single mismatch is :
```S[10..16] = bacabca

S[10..16] = bacabca
x
S[19..25] = cacabc.
[download]```
The way you get those results traditionally is by choping your initial string into n-mers and comparing them to every position in your string. and you repeat that by increasing the n-mer length until the only hit left is a self hit. Then you know that you found your answer

My approach was based on "divide and conquer" strategy. I chopped my strings into n-mers (n is a small number) to reduce the search space and hashed them. then i compared all-vs-all n-mers to get their relative distances, therefore producing a complex but still manageable graph. and now im stuck. there is no way to reconstruct the alignments from the info i produced in any reasonable amount of time.

Does anyone has any pointers or suggestions. I'm not looking for a complexity reduction solution but anything that can make these kind of queries, implementation vise fast. (and memory efficient )

Thank you

baxy

Replies are listed 'Best First'.
Re: string matching problem
by talexb (Canon) on Feb 09, 2012 at 17:37 UTC

Have you already had a look at BioPerl and found nothing there that will address this?

Alex / talexb / Toronto

"Groklaw is the open-source mentality applied to legal research" ~ Linus Torvalds

Re: string matching problem
by choroba (Bishop) on Feb 09, 2012 at 15:56 UTC
Can you define "shortest unique substring on some position"?

Log In?
 Username: Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://952760]
Approved by Corion
help
Chatterbox?
 [1nickt]: Wow, sometimes "Perl programming" hardly involves Perl! I just spent a couple of days implementing some features in the CPAN Testers API and it was 90% hand-editing JSON Swagger config and test data structs, and writing doc. Maybe that's just the sign of [Discipulus]: rethatching? a real, old cottage? [1nickt]: ... a well-crafted project that's easy to extend.

How do I use this? | Other CB clients
Other Users?
Others wandering the Monastery: (9)
As of 2017-11-21 11:49 GMT
Sections?
Information?
Find Nodes?
Leftovers?
Voting Booth?
In order to be able to say "I know Perl", you must have:

Results (297 votes). Check out past polls.

Notices?