Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Re^2: Search for identical substrings

by bioMan (Beadle)
on Aug 18, 2005 at 16:19 UTC ( [id://484851]=note: print w/replies, xml ) Need Help??


in reply to Re: Search for identical substrings
in thread Search for identical substrings

Your time estimates agree with mine. I calculated a time of completion of 3 years.

Thank you for your offer. I would like to look at all my options first, including, abandoning the project, optimizing the data be removing redundant sequences (no easy task given the lack of documentation for some of my data), or subclassing the data into smaller sets of sequences.

I would also like to look at the other responses I've received, but I will not forget your offer.

Replies are listed 'Best First'.
Re^3: Search for identical substrings
by GrandFather (Saint) on Aug 19, 2005 at 00:18 UTC

    Can you generate a data set that is representative of the problem and put it in your scratchpad?


    Perl is Huffman encoded by design.

      I have placed six actual strings from my database into my public scratchpad. Each string is formated as follows:

      >string 1 ATGCTGTAGCATGCATG...CGATCATGTGACTACGT >string 2 . . .

      The first line starts with ">" followed by a string ID. The second line is the actual data string.

        Here are my results from the 6 sequences you posted on your scratchpad. I must assume that this is a "constructed dataset" as all the LCSs are found at the same offset in both sequences in which they occur? I thought this was a bug when I first saw it, but it doesn't happen with any of my test data.

        There were no duplicate equal length matches. Some of the LCSs shown below are truncated for posting, but the Length and (offsets) and first 80 or so characters should be enough to verify the results. Confirmation or otherwise would be nice to have.

        If this data is representative, the time taken for the 15 pairing projects to a total runtime for your 300x3k of around 58 hours. Somewhat more palatable that 3 years:)

        Had you only wanted the single longest common string in the dataset, I can do that in under 6 hours.

        Updated: The offsets originally shown were all +10 due to my failing to remove the sequence labels. This has now been corrected.

        P:\test>484593-4 bioman.dat 000:001 L[ 72] (1557 1557) 'CCTTCTCATCTGCCGGACCGTGTGCACTTCGCTTCACCTCTGCACGTCGCATGGAGACCACCGTGAACG +CCC' 000:002 L[1271] ( 82 82) 'CAGAACCCTGCTCCGACTATTGCCTCTCTCACATCATCAATCTTCTTGAAGACTGGGGGCCCTGCTACG +AACATGGACA 000:003 L[ 225] (1128 1128) 'CAATACATGAACCTTTACCCCGTTGCTCGGCAACGGCCAGGCCTGTGCCAAGTGTTTGCTGACGCAACC +CCCACTGGTT 000:004 L[ 191] ( 619 619) 'TGGGCTTTAGGAAAATACCTATGGGAGTGGGCCTCAGCCCGTTTCTCCTGGCTCAGTTTACTAGTGCAA +TTTGTTCAGT 000:005 L[269] ( 292 292) 'GGGTGTCCTGGCCAAAATTCGCAGTCCCCAACCTCCAATCACTTACCAACCTCCTGTCCTCCAACTTGT +CCTGGCTATC 001:002 L[ 72] (1557 1557) 'CCTTCTCATCTGCCGGACCGTGTGCACTTCGCTTCACCTCTGCACGTCGCATGGAGACCACCGTGAACG +CCC' 001:003 L[ 72] (1557 1557) 'CCTTCTCATCTGCCGGACCGTGTGCACTTCGCTTCACCTCTGCACGTCGCATGGAGACCACCGTGAACG +CCC' 001:004 L[ 80] (1764 1764) 'TCTTTGTACTAGGAGGCTGTAGGCATAAATTGGTCTGTTCACCAGCACCATGCAACTTTTTCACCTCTG +CCTAATCAT 001:005 L[ 72] (1557 1557) 'CCTTCTCATCTGCCGGACCGTGTGCACTTCGCTTCACCTCTGCACGTCGCATGGAGACCACCGTGAACG +CCC' 002:003 L[ 320] (1128 1128) 'CAATACATGAACCTTTACCCCGTTGCTCGGCAACGGCCAGGCCTGTGCCAAGTGTTTGCTGACGCAACC +CCCACTGGTT 002:004 L[ 191] ( 619 619) 'TGGGCTTTAGGAAAATACCTATGGGAGTGGGCCTCAGCCCGTTTCTCCTGGCTCAGTTTACTAGTGCAA +TTTGTTCAGT 002:005 L[ 269] ( 292 292) 'GGGTGTCCTGGCCAAAATTCGCAGTCCCCAACCTCCAATCACTTACCAACCTCCTGTCCTCCAACTTGT +CCTGGCTATC 003:004 L[ 161] (1128 1128) 'CAATACATGAACCTTTACCCCGTTGCTCGGCAACGGCCAGGCCTGTGCCAAGTGTTTGCTGACGCAACC +CCCACTGGTT 003:005 L[ 510] (2693 2693) 'AAACCCTATTATCCTGATAACGTGGTTAATCATTATTTTAAGACCAGACACTATTTGCATACTTTATGG +AAGGCAGGCA 004:005 L[ 148] (1138 1138) 'ACCTTTACCCCGTTGCTCGGCAACGGCCAGGCCTGTGCCAAGTGTTTGCTGACGCAACCCCCACTGGTT +GGGGCTTGGC 15 trials of bioman.dat ( 70.142s total), 4.676s/trial

        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
        "Science is about questioning the status quo. Questioning authority".
        The "good enough" maybe good enough for the now, and perfection maybe unobtainable, but that should not preclude us from striving for perfection, when time, circumstance or desire allow.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://484851]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others perusing the Monastery: (3)
As of 2024-04-20 03:24 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found