Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

list of unique strings, also eliminating matching substrings

by lindsay_grey (Novice)
on May 21, 2011 at 02:40 UTC ( #906020=perlquestion: print w/ replies, xml ) Need Help??
lindsay_grey has asked for the wisdom of the Perl Monks concerning the following question:

What is the best way to create a list of unique strings that will also eliminate strings that are exact substrings of another string? For example, if i have AGCT, AGGT, GG, and AGCT, I only want to keep AGCT and AGGT. The direct string comparison approaches I have been reading about would, I think, consider AGGT and GG different, for example if I just create a hash using the strings as keys or use cmp.

I see I can choose one string and for each through the others comparing each pair using match, but I was wondering if there was a more efficient way as I have hundreds of sets of strings, each containing about 100,000 strings, and each string is about 300 characters long.

Thank you.

Comment on list of unique strings, also eliminating matching substrings
Select or Download Code
Re: list of unique strings, also eliminating matching substrings
by BrowserUk (Pope) on May 21, 2011 at 03:16 UTC
    I have hundreds of sets of strings, each containing about 100,000 strings, and each string is about 300 characters long.

    A few questions:

    1. You want to eliminate the dups in each of the files? Or across all of the files?
    2. What (roughly) are the maximum and minimum lengths of the strings?
    3. Do they consist soley of ACGT or are the other characters (X N etc.)?

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      1. I want to eliminate the duplicates within each file.

      2. The strings range from 200 to 400 characters.

      3. The complete alphabet is A, G, C, T, N.

      Note for point 1. I want to eliminate not just the exact duplicates but also those that are contained within a longer string.

        Presumably you've code the obvious two loops method and it is taking too long. Could you supply a timing for one of your datasets?


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: list of unique strings, also eliminating matching substrings
by wind (Priest) on May 21, 2011 at 03:32 UTC
    First method, an old fashioned loop within a loop.
    my @list = qw(AGCT AGGT GG AGCT); MAIN: for my $i (0..$#list) { my $substr_re = qr/$list[$i]/; for my $j (0..$#list) { next if $i == $j || ! defined $list[$j]; if ($list[$j] =~ $substr_re) { undef $list[$i]; next MAIN; } } } my @unique = grep {defined} @list; print "$_\n" for @unique;

    Update: Increase efficiency by grouping the strings by size before processing:

    use strict; use warnings; my @list = qw(AGCT AGGT GG AGCT); my %bucket; for (@list) { push @{$bucket{length($_)}}, $_; } # Only want to sort these once. my @sizes = sort {$a <=> $b} keys %bucket; while (my $size = shift @sizes) { MAIN: for my $i (0..$#{$bucket{$size}}) { # Same Size first for my $j ($i+1..$#{$bucket{$size}}) { if ($bucket{$size}[$i] eq $bucket{$size}[$j]) { undef $bucket{$size}[$i]; next MAIN; } } # Bigger strings my $substr_re = qr/$bucket{$size}[$i]/; for my $bigger (@sizes) { for my $str (@{$bucket{$bigger}}) { if ($str =~ $substr_re) { undef $bucket{$size}[$i]; next MAIN; } } } } } my @unique = grep {defined} map {@$_} values %bucket; print "$_\n" for @unique;
      I was thinking that might be necessary but hoping not. Thank you for posting your code.
      Thank you! I will definitely try this too.
Re: list of unique strings, also eliminating matching substrings
by Tanktalus (Canon) on May 21, 2011 at 03:41 UTC

    I guess I'd start by sorting the strings by length - largest first (hopefully the order of unique strings won't matter - if they do, then it gets moderately, but not immensely, more complex):

    @sequences = sort { length($b) <=> length($a) } @sequences
    Once we have that, then I would go through and find the unique ones. Now, given that we are going from largest to smallest, the item we're about to put in the array must either match or be a subset of an existing item. It cannot be the other way around: that an item in the output list is a subset of the item we're looking at. Thus we only have to check against the list that we have so far:
    use List::MoreUtils qw(any); my @uniq_sequences; for my $seq (@sequences) { push @uniq_sequences, $seq if any { index $_, $seq >= 0 } @uniq_sequences; }
    Now, since we're only looping over the large array once, and the small array many times, this may be Fast Enough. In fact, we're not looping over everything in the small array - unless there is no match found (or the item we're looking for is the last one).

    There are definitely items I can think of benchmarking to see if there are speedups to be found. One possibility would be to go from small to large (reverse order) and see if a regex-optimiser could smoosh the words you're looking for into an optimised search. While I doubt this would actually speed anything up (the optimisation might dwarf the loop in my original solution above), only a benchmark would be sure.

    Another possibility would be to compile your sequences down to byte sequences. Since this looks like DNA, thus only four letters, each position could basically be a 2 bits of data: A could be 0b00, G could be 0b01, C could be 0b10, and T could be 0b11. With some serious bit-manipulating math, where you have to shift stuff around for comparisons, you may be able to get more speed. Again, a benchmark would need to be done to be sure. This one likely holds some promise, but at the expense of some serious development time. My guess? It's not worth it. For the amount of money your employer would be paying you for that amount of time, it's probably cheaper to buy a faster CPU and/or more RAM, to run the original algorithm. And then you'll have a faster computer, too :-) (Of course, if someone gives you an already-tested solution to your issue using this algorithm, especially if it comes with unit tests, then TAKE IT.)

      Thank you! I will definitely try this and hope that it is Fast Enough so as to avoid the Serious Bit-manipulating Math!
Re: list of unique strings, also eliminating matching substrings
by AnomalousMonk (Monsignor) on May 21, 2011 at 04:05 UTC

    For eliminating 'duplicates' (as I understand you to define them) within each 'set' of sequences (i.e., each file), maybe something like:

    >perl -wMstrict -le "use List::MoreUtils qw(uniq); ;; my @seqs = qw(AGCT AGGT GG AGCT CTAG); ;; my $seen = ''; my $delim = ':'; ;; my @no_dups = grep { ($seen !~ m{$_}xms) && ($seen .= $delim . $_) } uniq @seqs ; print qq{'$_'} for @no_dups; " 'AGCT' 'AGGT' 'CTAG'

    Update: Using index might be slightly faster than using a regex in the preceding code
        grep { (index($seen, $_) < 0) && ($seen .= $delim . $_) }
    but I wouldn't count on it. When in doubt, Benchmark.

    Sheepish Update: The approach given in the initial reply does not work (insofar as I understand the requirement). This can be confirmed with the test set
        qw(AG GC CT AGCT AGGT GG AGCT CTAG)
    instead of the one given originally: although AG GC CT are substrings of subsequent sequences, they are not eliminated.

    However, I have another approach that is, I believe, more satisfactory. It attempts to do substring elimination entirely within the regex engine. The order of the input sequence array is maintained.

    Belated Update: After futzing with this problem a bit more, I have finally settled on an approach using index to scan for and eliminate substrings after eliminating identical sequences of equal length with uniq (see List::MoreUtils). The primary motivation behind the regex approach of my sheepish update was to gain some experience with the new Special Backtracking Control Verbs of 5.10+ and to introduce myself to (*COMMIT). However, it seems to me that index is likely to be much more efficient, although I have made no attempt at any benchmarking.

    In any event, here is my final cut.

Re: list of unique strings, also eliminating matching substrings
by GrandFather (Cardinal) on May 21, 2011 at 06:02 UTC

    When you say "about 300 characters long", what is the actual range? Are there any constraints on where a substring may match a larger string? Can there be exact matches within a set of strings and if so should duplicates be removed?

    Update: length question already answered I see.

    Update: and the key question I didn't ask: how many strings of the original 100,000 do you expect you might end up with after duplicates and substrings are removed?

    True laziness is hard work

      i don't think there are any constraints on where a substring may match a larger string.

      there can be exact matches. we are removing those using the unique function.

      for the current set of sequences (assuming the program is working correctly), we go from 206,737 sequences to 166,089. not sure how many are removed because they are exact matches and how many are removed because they match a substring of a larger sequence. i will add that check.

      i think the program is working correctly, but it takes a long time, longer than it seems it should take. (~5 hours on a pretty fast computer) getting the unique sequences is fast, but we have an additional step where we go back and retrieve the sequence name and then do the substring comparison. it is these two steps that are so slow.

Re: list of unique strings, also eliminating matching substrings
by jaredor (Deacon) on May 21, 2011 at 07:58 UTC

    Here's something that uses some of the best ideas of wind (proceed by decreasing length) and AnomalousMonk (try using index as a first pass at matching).

    #!/usr/bin/env perl use strict; use warnings; my @strings = qw(AGCT AGGT GG AGCT); my %uniques; $uniques{shift @strings}++ while @strings; my @slots; while (my $i = each %uniques) { push @{$slots[length $i]}, $i; delete $uniques{i}; } my $master = join (':', @{pop @slots}); while (@slots) { my @nomatch = grep {index ($master, $_) < 0} @{pop @slots or []}; $master .= ':' . join (':', @nomatch) if @nomatch; } # answer $master =~ s/:/\n/g; print $master;

    BrowserUK's questions are excellent. They affected my comment by reminding me to set expectations: My suggested solution will require a machine with enough memory to hold the entire data set. However, I've tried to keep the memory usage not too much more than that.

    If I were more of a regexp whiz, I'd try to come up with some way of reducing matching up to a ':' boundary, but I'm not. Besides, I'm a recovering FORTRAN programmer, so the index command is the programming equivalent of comfort food for me.

    I used a slightly more robust test set than what is listed here, but I emphasize "slightly". YMMV on real data with more corner cases....

    Edit: "across" replaced by "up to" in the penultimate paragraph above. If you want to match a 399 character string within a 400 character string, you only need to check matches starting with the first two characters of the 400 character string, but the master string concatenation of all reference strings defeats any such understanding index may have. The hope is that one index on a long string is faster than N index calls on smaller strings (but I'm too lazy to check this today :-) It is very tempting to try to compile the master string into a savvy regexp (with the "o" flag) anew with each iteration of the last while loop and I'd be interested in seeing any such solution.

Re: list of unique strings, also eliminating matching substrings
by roboticus (Canon) on May 21, 2011 at 19:14 UTC

    lindsay_grey:

    I've been plunking around for about 4 hours on this one (it's an interesting problem!). I first built a test data generator to generate some datasets.

    My primary datasets are 100, 200, 500, 1000, 2000, 5000 and 10000 strings each, where the strings are between 15 and 25 characters long. I generated them like:

    $ for J in {1,2,5}0{0,00,000}; do echo $J; perl gen_random_string.pl $ +J ACGTN 15 25 >t.$J; done

    I next created a trivial brute-force solver:

    The brute force solver told me that all my datasets contained only unique strings. So I created some datasets with plenty of duplicates:

    $ cat t.100 t.100 t.100 > t.300 $ cat t.1000 t.1000 t.1000 > t.3000 $ cat t.10000 t.10000 t.10000 > t.30000 $ cat t.100 t.300 > t.400 $ cat t.1000 t.3000 > t.4000 $ cat t.10000 t.30000 > t.40000

    I've been monkeying with some different bits, but my best two (so far) give me the times:

    num brute strings force Robo1 Robo2 ------- ------- ----- ----- 100 .125 .125 .110 200 .234 .172 .125 300 .202 .141 .110 400 .234 .156 .110 500 1.030 .187 .125 1000 3.916 .265 .188 2000 15.288 .390 .265 3000 11.435 .546 .328 4000 15.319 .656 .422 5000 93.600 .858 .546 10000 377.412 1.638 1.029 20000 3.151 1.981 30000 4.493 2.621 40000 5.866 3.417 50000 4.929

    I then created a few datasets with strings between 200 and 300 characters to see how my better one did:

    # str Robo2 Notes ------ ------ -------------- 1000 0.687 unique 2000 1.264 1000 unique 10000 6.412 unique 20000 11.887 10000 unique 100000 65.224 unique 200000 126.190 100000 unique

    I'll wait a little while before posting my solution, as I don't want to spoil things for people still working on it right now.

    ...roboticus

    When your only tool is a hammer, all problems look like your thumb.

      Hmmm ... I thought there would be more activity on this thread. No-one seems to be actively working on it, so here's the code I used to get my timings.

      #!/usr/bin/perl # # multi-string-match.pl <FName> # # Grind through a set of strings, and keep only the ones that don't + contain # any of the others as a substring. FName is a file containing a l +ist of # strings, and if null, we'll use our test data. # # Inspired by perlmonks node 906020, and the Knuth-Morris-Pratt alg +orithm. # use strict; use warnings; use feature ':5.10'; # function is 10.67 chars wide, so need to round up, or we can't find +partials # (previous state will linger, so we can't find 'em!) my $hashwidth = 11; # our alphabet my %xlat = (A=>1, C=>2, G=>3, T=>4, N=>0); my @unique; my @candidates; my %MatchKeys; my $fname = shift; open my $FH, '<', $fname or die; @candidates = <$FH>; @candidates = grep { /^[ACGTN]+$/ } # delete the comments map { s/^\s+//; s/\s+$//; $_ } @candidates; my $start = time; @candidates = sort { length($a) <=> length($b) || $a cmp $b } @candida +tes; my (@keypath, $t); #, @chars, @keypath); my $cnt_dup=0; CANDIDATE: while ($t = shift @candidates) { my $h = 0; my $keywidth=0; @keypath=(); my $rMatchKeys = \%MatchKeys; my $fl_partial=-1; my $l = length($t); while ($keywidth < $l) { $h = hash(substr($t,$keywidth,1), $h); ++$keywidth; if ($keywidth % $hashwidth == 0) { push @keypath, $h; } if ($fl_partial < 0) { # No current partial match if (exists $MatchKeys{$h}) { $rMatchKeys = $$rMatchKeys{$h}; $fl_partial = $keywidth; } } else { if ( ($keywidth - $fl_partial) % $hashwidth == 0 ) { $rMatchKeys = exists($$rMatchKeys{$h}) ? $$rMatchKeys{ +$h} : undef; } elsif (exists($$rMatchKeys{REM}) and exists($$rMatchKeys{R +EM}{$h})) { ++$cnt_dup; next CANDIDATE; } } } my $ar = [ $h, $keywidth % $hashwidth ]; ### Add the path to %MatchKeys $rMatchKeys = \%MatchKeys; while (my $r = shift @keypath) { $$rMatchKeys{$r} = { } if !exists $$rMatchKeys{$r}; $rMatchKeys = $$rMatchKeys{$r}; } $$rMatchKeys{REM} = { } if !exists $$rMatchKeys{REM}; if (exists($$rMatchKeys{REM}{$$ar[0]}) and $$ar[1] == $$rMatchKeys{REM}{$$ar[0]}) { ++$cnt_dup; next CANDIDATE; } $$rMatchKeys{REM}{$$ar[0]} = $$ar[1]; push @unique, $t; } my $end = time - $start; print scalar(@unique), " unique items\n"; print "$cnt_dup rejected.\n"; print "$end seconds.\n"; sub hash { my ($curchar, $prevhash) = @_; $prevhash = ($prevhash * 8 + $xlat{$curchar}) & 0xffffffff; }

      ...roboticus

      When your only tool is a hammer, all problems look like your thumb.

        I used this generator to create a 10000 string file where the first 5000 string are just randomly generated and the other 5000 are random substring extracted from the first 5000. Thus, you'd expect at most 5000 unique strings with a very slight possibility of there being fewer:

        #! perl -slw use strict; sub rndStr{ join'', @_[ map{ rand $#_ } 1 .. shift ] } our $N //= 10e3; my $halfN = $N >> 1; my @data; $#data = $N; $data[ $_ ] = rndStr( 200 +int( rand 200 ), 'A', 'C', 'G', 'T', 'N' ) for 0 .. $halfN; $data[ $_ + $halfN ] = substr( $data[ $_ ], 10, 10 + int( rand( length( $data[ $_ ] ) - 20 ) ) ) for 0 .. $halfN; print for @data; __END__ C:\test> 906020-gen -N=10e3 > 906020.10e3

        When I run your code on this file it misses some dups:

        C:\test>906020-robo 906020.10e3 5551 unique items 4450 rejected. 5 seconds.

        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
        I gave it a try to test your code. There are substrings among the output, but did not figure out why. It seems there are bugs within it. Nice code though!
Re: list of unique strings, also eliminating matching substrings
by sundialsvc4 (Monsignor) on May 23, 2011 at 12:30 UTC

    Since memory-size is a ruling constraint here, whereas disk space is not, one approach might involve first “exploding” the input data into its component parts, writing these to a disk file, sorting that file by the “component part” field, and then reading it back in ... thus taking advantage of the fact that (1) disk-based sorts are very efficient, and (2) in a known-sorted data set, all occurrences of any particular key are always adjacent.

    For instance, given any set of 100,000 strings, you first explode all of these into a file containing, probably, a few million tuples of (string, component).   Then you sort this file into a spill-file by (component, string) using an external disk sort, then you read the spill file.   Thus, using a trivial algorithm that only needs to compare “this” record with “the previous one,” you can winnow the wheat (strings which contain components that occur more than one time, in strings that are not identical) from the chaff (strings that do not meet this criteria), in a single sequential pass.   At first blush, it seems to me that the algorithm thus described might be a solution to your problem ... and it will work equally well for arbitrarily large datasets.

      Utter garbage!

      Since memory-size is a ruling constraint here,

      100,000 strings of max. 400 characters gives 40MB.

      Even with the overhead of an array with 64-bit pointers, the total memory requirement is 44,25MB. (MAX)

      Even my 233Mhz/128MB Thinkpad 770 from 1997 could have handled that.

      thus taking advantage of the fact that (1) disk-based sorts are very efficient

      No! They are not!

      Not when compared to memory based sorts.

      And given that the cheapest commodity PC you can buy can trivially handle sorting 44.25MB in memory in the blink of an eye, (0.404149055480957 seconds on my machine), there is absolutely no point what so ever in writing the stuff to disk in order to sort it.

      Just writing it to disk (cache) takes almost exactly as long (0.361000061035156 seconds). And that's before you've loaded up another process, to read it back to memory, sort it, write it back to disk and then read it back in.

      I just up voted one of your answers (re:COW) and then read this garbage. Why do you post this? It's like your brain is caught in a time warp.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://906020]
Approved by Tanktalus
Front-paged by planetscape
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others browsing the Monastery: (8)
As of 2014-08-22 08:05 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The best computer themed movie is:











    Results (150 votes), past polls