Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer

string matching problem

by baxy77bax (Chaplain)
on Feb 09, 2012 at 15:45 UTC ( #952760=perlquestion: print w/replies, xml ) Need Help??
baxy77bax has asked for the wisdom of the Perl Monks concerning the following question:

Hello monks

I am turning to you after a long quest. I have been searching for the answer to a problem that does not allow me to sleep, eat or think. In past few months I have been struggling with a problem that has now turned to my obsession. Although many have tried to convince me that this is an NP problem ergo, has no practical solution, I have this gut feeling that if i somehow break the problem into small fragments i'll be able to solve it.

Now before i discourage you from even attempting to read the post i'll explain the problem

I am given a long string over the 3 character alphabet.

S= acccabacbbacabcabcacabc....
This string is usually 5-10 bilion characters long. What I am trying to figure out is how long is the shortest unique substring of that string on some position S(x) given that i allow k number of mismatches.
So let say that I am looking at the position S(10)
S[10..13] = baca
now this is the shortest unique substring on S(10), however the shortest unique substring on the same position allowing a single mismatch is :
S[10..16] = bacabca S[10..16] = bacabca x S[19..25] = cacabc.
The way you get those results traditionally is by choping your initial string into n-mers and comparing them to every position in your string. and you repeat that by increasing the n-mer length until the only hit left is a self hit. Then you know that you found your answer

My approach was based on "divide and conquer" strategy. I chopped my strings into n-mers (n is a small number) to reduce the search space and hashed them. then i compared all-vs-all n-mers to get their relative distances, therefore producing a complex but still manageable graph. and now im stuck. there is no way to reconstruct the alignments from the info i produced in any reasonable amount of time.

Does anyone has any pointers or suggestions. I'm not looking for a complexity reduction solution but anything that can make these kind of queries, implementation vise fast. (and memory efficient )

Thank you


Replies are listed 'Best First'.
Re: string matching problem
by talexb (Canon) on Feb 09, 2012 at 17:37 UTC

    Have you already had a look at BioPerl and found nothing there that will address this?

    Alex / talexb / Toronto

    "Groklaw is the open-source mentality applied to legal research" ~ Linus Torvalds

Re: string matching problem
by choroba (Bishop) on Feb 09, 2012 at 15:56 UTC
    Can you define "shortest unique substring on some position"?

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://952760]
Approved by Corion
[1nickt]: Wow, sometimes "Perl programming" hardly involves Perl! I just spent a couple of days implementing some features in the CPAN Testers API and it was 90% hand-editing JSON Swagger config and test data structs, and writing doc. Maybe that's just the sign of
[Discipulus]: rethatching? a real, old cottage?
[1nickt]: ... a well-crafted project that's easy to extend.

How do I use this? | Other CB clients
Other Users?
Others wandering the Monastery: (9)
As of 2017-11-21 11:49 GMT
Find Nodes?
    Voting Booth?
    In order to be able to say "I know Perl", you must have:

    Results (297 votes). Check out past polls.