Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

Re: list of unique strings, also eliminating matching substrings

by Tanktalus (Canon)
on May 21, 2011 at 03:41 UTC ( #906032=note: print w/replies, xml ) Need Help??


in reply to list of unique strings, also eliminating matching substrings

I guess I'd start by sorting the strings by length - largest first (hopefully the order of unique strings won't matter - if they do, then it gets moderately, but not immensely, more complex):

@sequences = sort { length($b) <=> length($a) } @sequences
Once we have that, then I would go through and find the unique ones. Now, given that we are going from largest to smallest, the item we're about to put in the array must either match or be a subset of an existing item. It cannot be the other way around: that an item in the output list is a subset of the item we're looking at. Thus we only have to check against the list that we have so far:
use List::MoreUtils qw(any); my @uniq_sequences; for my $seq (@sequences) { push @uniq_sequences, $seq if any { index $_, $seq >= 0 } @uniq_sequences; }
Now, since we're only looping over the large array once, and the small array many times, this may be Fast Enough. In fact, we're not looping over everything in the small array - unless there is no match found (or the item we're looking for is the last one).

There are definitely items I can think of benchmarking to see if there are speedups to be found. One possibility would be to go from small to large (reverse order) and see if a regex-optimiser could smoosh the words you're looking for into an optimised search. While I doubt this would actually speed anything up (the optimisation might dwarf the loop in my original solution above), only a benchmark would be sure.

Another possibility would be to compile your sequences down to byte sequences. Since this looks like DNA, thus only four letters, each position could basically be a 2 bits of data: A could be 0b00, G could be 0b01, C could be 0b10, and T could be 0b11. With some serious bit-manipulating math, where you have to shift stuff around for comparisons, you may be able to get more speed. Again, a benchmark would need to be done to be sure. This one likely holds some promise, but at the expense of some serious development time. My guess? It's not worth it. For the amount of money your employer would be paying you for that amount of time, it's probably cheaper to buy a faster CPU and/or more RAM, to run the original algorithm. And then you'll have a faster computer, too :-) (Of course, if someone gives you an already-tested solution to your issue using this algorithm, especially if it comes with unit tests, then TAKE IT.)

Replies are listed 'Best First'.
Re^2: list of unique strings, also eliminating matching substrings
by lindsay_grey (Novice) on May 21, 2011 at 03:53 UTC
    Thank you! I will definitely try this and hope that it is Fast Enough so as to avoid the Serious Bit-manipulating Math!

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://906032]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others exploiting the Monastery: (5)
As of 2020-12-03 00:42 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    How often do you use taint mode?





    Results (49 votes). Check out past polls.

    Notices?