Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

string matching problem

by baxy77bax (Chaplain)
on Feb 09, 2012 at 15:45 UTC ( #952760=perlquestion: print w/ replies, xml ) Need Help??
baxy77bax has asked for the wisdom of the Perl Monks concerning the following question:

Hello monks

I am turning to you after a long quest. I have been searching for the answer to a problem that does not allow me to sleep, eat or think. In past few months I have been struggling with a problem that has now turned to my obsession. Although many have tried to convince me that this is an NP problem ergo, has no practical solution, I have this gut feeling that if i somehow break the problem into small fragments i'll be able to solve it.

Now before i discourage you from even attempting to read the post i'll explain the problem

I am given a long string over the 3 character alphabet.

S= acccabacbbacabcabcacabc....
This string is usually 5-10 bilion characters long. What I am trying to figure out is how long is the shortest unique substring of that string on some position S(x) given that i allow k number of mismatches.
So let say that I am looking at the position S(10)
S[10..13] = baca
now this is the shortest unique substring on S(10), however the shortest unique substring on the same position allowing a single mismatch is :
S[10..16] = bacabca S[10..16] = bacabca x S[19..25] = cacabc.
The way you get those results traditionally is by choping your initial string into n-mers and comparing them to every position in your string. and you repeat that by increasing the n-mer length until the only hit left is a self hit. Then you know that you found your answer

My approach was based on "divide and conquer" strategy. I chopped my strings into n-mers (n is a small number) to reduce the search space and hashed them. then i compared all-vs-all n-mers to get their relative distances, therefore producing a complex but still manageable graph. and now im stuck. there is no way to reconstruct the alignments from the info i produced in any reasonable amount of time.

Does anyone has any pointers or suggestions. I'm not looking for a complexity reduction solution but anything that can make these kind of queries, implementation vise fast. (and memory efficient )

Thank you

baxy

Comment on string matching problem
Select or Download Code
Re: string matching problem
by choroba (Abbot) on Feb 09, 2012 at 15:56 UTC
    Can you define "shortest unique substring on some position"?
Re: string matching problem
by talexb (Canon) on Feb 09, 2012 at 17:37 UTC

    Have you already had a look at BioPerl and found nothing there that will address this?

    Alex / talexb / Toronto

    "Groklaw is the open-source mentality applied to legal research" ~ Linus Torvalds

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://952760]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others imbibing at the Monastery: (5)
As of 2014-09-19 04:54 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (129 votes), past polls