Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Re: Search for identical substrings

by GrandFather (Saint)
on Aug 17, 2005 at 21:00 UTC ( [id://484598]=note: print w/replies, xml ) Need Help??


in reply to Search for identical substrings

Can you provide a sample of the type of data you are actually using. You talk of characters, but show numbers in your sample. The nature of the data may make a big difference to how the problem can be solved.


Perl is Huffman encoded by design.

Replies are listed 'Best First'.
Re^2: Search for identical substrings
by bioMan (Beadle) on Aug 18, 2005 at 15:52 UTC

    The data I am working with are DNA sequences. (I know of the Bioperl project but have found nothing there that could help.) I have distilled the data to the DNA sequences alone. I have a second file with the sequence IDs, which is why I also print out the indicies from the matching strings. The data look like this.

    ATGGAGAACATCACATCAGGACTCCTAGGACCCCTTCTCGTGTTACAGGC
    ATGGAGAACATCACATCACGACTCCTAGGACCCCTTCACGTGAAACAGGC
    ATGCTCAACGTCACATCAGGACTCCTAGGACCACGTCTCGTGTTACAGGG
    ATGGTGTACATCACGACAGGATTCCTCGGAATCGCGCTGGTGACACAGGC

    With the sequence IDs the data would look like this.

    >seq1
    ATGGAGAACATCACATCAGGACTCCTAGGACCCCTTCTCGTGTTACAGGC
    >seq2
    ATGGAGAACATCACATCACGACTCCTAGGACCCCTTCACGTGAAACAGGC
    >seq3
    ATGCTCAACGTCACATCAGGACTCCTAGGACCACGTCTCGTGTTACAGGG
    >seq4
    ATGGTGTACATCACGACAGGATTCCTCGGAATCGCGCTGGTGACACAGGC

    ### update ###

    I have placed real data in my public scratchpad.

      I get these results:

      484593-3 Best match len:22 betwixt 0:10 & 2:10 ATGGAGAACATCACATCAGGACTCCTAGGACCCCTTCTCGTGTTACAGGC TCACATCAGGACTCCTAGGACC ATGCTCAACGTCACATCAGGACTCCTAGGACCACGTCTCGTGTTACAGGG TCACATCAGGACTCCTAGGACC 6 trials of N:4 L:50 MIN:2 ( 4.098ms total), 682us/trial

      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
      "Science is about questioning the status quo. Questioning authority".
      The "good enough" maybe good enough for the now, and perfection maybe unobtainable, but that should not preclude us from striving for perfection, when time, circumstance or desire allow.
      So, just to see if I understand the task... given those four (truncated?) lines of input data, would the following be the "right" answer?
      LCS for 0 :: 1 = |ATGGAGAACATCACATCA| LCS for 0 :: 2 = |TCACATCAGGACTCCTAGGACC| LCS for 0 :: 3 = |CATCAC| LCS for 1 :: 2 = |ACTCCTAGGACC| LCS for 1 :: 3 = |CATCAC| LCS for 2 :: 3 = |CAGGA|
      This doesn't keep track of the actual index offsets where the longest match actually starts in each string for each pairwise comparison, but that would be easy to add.

      That's the output from the code posted in my later reply in this thread, given those four lines of sample data as input.

        Unfortunately, the answers are not simple and that's my fault for not being more careful with the test data I created. The correct answers are:

        0 :: 1 ATGGAGAACATCACATCA and GACTCCTAGGACCCCTTC 0 :: 2 TCACATCAGGACTCCTAGGACC 0 :: 3 ACATCAC 1 :: 2 GACTCCTAGGACC 1 :: 3 ACATCAC 2 :: 3 CAGGA and ACAGG

        Unfortunately my data doesn't give simple answers, and that is my fault completely for not being more careful with how I constructed the test data. My results are:

        0 :: 1 ATGGAGAACATCACATCA and GACTCCTAGGACCCCTTC 0 :: 2 TCACATCAGGACTCCTAGGACC 0 :: 3 ACATCAC 1 :: 2 GACTCCTAGGACC 1 :: 3 ACATCAC 2 :: 3 CAGGA and ACAGG

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://484598]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others exploiting the Monastery: (5)
As of 2024-03-29 10:57 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found