Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

comment on

( #3333=superdoc: print w/replies, xml ) Need Help??

I guess I'd start by sorting the strings by length - largest first (hopefully the order of unique strings won't matter - if they do, then it gets moderately, but not immensely, more complex):

@sequences = sort { length($b) <=> length($a) } @sequences
Once we have that, then I would go through and find the unique ones. Now, given that we are going from largest to smallest, the item we're about to put in the array must either match or be a subset of an existing item. It cannot be the other way around: that an item in the output list is a subset of the item we're looking at. Thus we only have to check against the list that we have so far:
use List::MoreUtils qw(any); my @uniq_sequences; for my $seq (@sequences) { push @uniq_sequences, $seq if any { index $_, $seq >= 0 } @uniq_sequences; }
Now, since we're only looping over the large array once, and the small array many times, this may be Fast Enough. In fact, we're not looping over everything in the small array - unless there is no match found (or the item we're looking for is the last one).

There are definitely items I can think of benchmarking to see if there are speedups to be found. One possibility would be to go from small to large (reverse order) and see if a regex-optimiser could smoosh the words you're looking for into an optimised search. While I doubt this would actually speed anything up (the optimisation might dwarf the loop in my original solution above), only a benchmark would be sure.

Another possibility would be to compile your sequences down to byte sequences. Since this looks like DNA, thus only four letters, each position could basically be a 2 bits of data: A could be 0b00, G could be 0b01, C could be 0b10, and T could be 0b11. With some serious bit-manipulating math, where you have to shift stuff around for comparisons, you may be able to get more speed. Again, a benchmark would need to be done to be sure. This one likely holds some promise, but at the expense of some serious development time. My guess? It's not worth it. For the amount of money your employer would be paying you for that amount of time, it's probably cheaper to buy a faster CPU and/or more RAM, to run the original algorithm. And then you'll have a faster computer, too :-) (Of course, if someone gives you an already-tested solution to your issue using this algorithm, especially if it comes with unit tests, then TAKE IT.)


In reply to Re: list of unique strings, also eliminating matching substrings by Tanktalus
in thread list of unique strings, also eliminating matching substrings by lindsay_grey

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?
    Username:
    Password:

    What's my password?
    Create A New User
    Chatterbox?
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others chanting in the Monastery: (3)
    As of 2020-11-29 07:14 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?

      No recent polls found

      Notices?