Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW

comment on

( #3333=superdoc: print w/replies, xml ) Need Help??

I guess I'd start by sorting the strings by length - largest first (hopefully the order of unique strings won't matter - if they do, then it gets moderately, but not immensely, more complex):

@sequences = sort { length($b) <=> length($a) } @sequences
Once we have that, then I would go through and find the unique ones. Now, given that we are going from largest to smallest, the item we're about to put in the array must either match or be a subset of an existing item. It cannot be the other way around: that an item in the output list is a subset of the item we're looking at. Thus we only have to check against the list that we have so far:
use List::MoreUtils qw(any); my @uniq_sequences; for my $seq (@sequences) { push @uniq_sequences, $seq if any { index $_, $seq >= 0 } @uniq_sequences; }
Now, since we're only looping over the large array once, and the small array many times, this may be Fast Enough. In fact, we're not looping over everything in the small array - unless there is no match found (or the item we're looking for is the last one).

There are definitely items I can think of benchmarking to see if there are speedups to be found. One possibility would be to go from small to large (reverse order) and see if a regex-optimiser could smoosh the words you're looking for into an optimised search. While I doubt this would actually speed anything up (the optimisation might dwarf the loop in my original solution above), only a benchmark would be sure.

Another possibility would be to compile your sequences down to byte sequences. Since this looks like DNA, thus only four letters, each position could basically be a 2 bits of data: A could be 0b00, G could be 0b01, C could be 0b10, and T could be 0b11. With some serious bit-manipulating math, where you have to shift stuff around for comparisons, you may be able to get more speed. Again, a benchmark would need to be done to be sure. This one likely holds some promise, but at the expense of some serious development time. My guess? It's not worth it. For the amount of money your employer would be paying you for that amount of time, it's probably cheaper to buy a faster CPU and/or more RAM, to run the original algorithm. And then you'll have a faster computer, too :-) (Of course, if someone gives you an already-tested solution to your issue using this algorithm, especially if it comes with unit tests, then TAKE IT.)

In reply to Re: list of unique strings, also eliminating matching substrings by Tanktalus
in thread list of unique strings, also eliminating matching substrings by lindsay_grey

Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":

  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or or How to display code and escape characters are good places to start.
Log In?

What's my password?
Create A New User
Domain Nodelet?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others lurking in the Monastery: (3)
As of 2021-09-28 01:13 GMT
Find Nodes?
    Voting Booth?

    No recent polls found