<?xml version="1.0" encoding="windows-1252"?>
<node id="906085" title="Re: list of unique strings, also eliminating matching substrings" created="2011-05-21 15:14:18" updated="2011-05-21 15:14:18">
<type id="11">
note</type>
<author id="533863">
roboticus</author>
<data>
<field name="doctext">
&lt;p&gt;[lindsay_grey]:&lt;/p&gt;
&lt;p&gt;I've been plunking around for about 4 hours on this one (it's an interesting problem!).  I first built a test data generator to generate some datasets.&lt;/p&gt;
&lt;readmore&gt;
&lt;c&gt;
#!/usr/bin/perl
#
#       gen_random_string.pl &lt;CNT&gt; &lt;SYMS&gt; &lt;minlen&gt; &lt;maxlen&gt;
#
#       Generate &lt;CND&gt; random strings between &lt;minlen&gt; and &lt;maxlen&gt; characters
#       using only the characters in &lt;SYMS&gt;.
#
use strict;
use warnings;

my $cnt = shift;
my $SYMS = shift;
my $min = shift;
my $max = shift or die usage('Missing args!');

die "CNT ('$cnt') must be numeric!\n" if $cnt =~ /[^0-9]/;
die "MIN ('$min') must be numeric!\n" if $min =~ /[^0-9]/;
die "MAX ('$max') must be numeric!\n" if $max =~ /[^0-9]/;
die "SYMS must be longer than 0 characters!\n" if length($SYMS) &lt; 1;
if ($min&gt;$max) {
        print "Min must be &lt;= Max, swapping!\n";
        my $t=$min; $min=$max; $max=$t;
}

my @syms = split //,$SYMS;
#print join(", ", @syms),"\n";

while ($cnt) {
        my $len=$min + int(($max-$min)*rand);
        my $t='';
        $t .= $syms[int(rand(@syms))] for 1 .. $len;
        print "$t\n";
        --$cnt;
}
&lt;/c&gt;
&lt;/readmore&gt;
&lt;p&gt;My primary datasets are 100, 200, 500, 1000, 2000, 5000 and 10000 strings each, where the strings are between 15 and 25 characters long.  I generated them like:&lt;/p&gt;
&lt;c&gt;
$ for J in {1,2,5}0{0,00,000}; do echo $J; perl gen_random_string.pl $J ACGTN 15 25 &gt;t.$J; done
&lt;/c&gt;
&lt;p&gt;I next created a trivial brute-force solver:&lt;/p&gt;
&lt;readmore&gt;
&lt;c&gt;
#!/usr/bin/perl
#
#       multi-string-match_brute_force.pl &lt;FName&gt;
#
use strict;
use warnings;
use feature ':5.10';

my $fname = shift;
open my $FH, '&lt;', $fname or die;
my @candidates = &lt;$FH&gt;;

@candidates =
        grep { /^[ACGTN]+$/ } # delete the comments
        map { s/^\s+//; s/\s+$//; $_ }
        @candidates;

my $start = time;
@candidates = sort { length($a) &lt;=&gt; length($b) || $a cmp $b } @candidates;
my @unique;
my $cnt_dup=0;
OUTER:
while (my $t = shift @candidates) {
        for my $u (@unique) {
                if ($t =~ /$u/) {
                        ++$cnt_dup;
                        next OUTER;
                }
        }
        push @unique, $t;
}
my $end = time - $start;

print scalar(@unique), " unique items.\n";
print "$cnt_dup rejected.\n";
print "$end seconds\n";
&lt;/c&gt;
&lt;/readmore&gt;
&lt;p&gt;The brute force solver told me that all my datasets contained only unique strings.  So I created some datasets with plenty of duplicates:&lt;/p&gt;
&lt;c&gt;
$ cat t.100 t.100 t.100 &gt; t.300
$ cat t.1000 t.1000 t.1000 &gt; t.3000
$ cat t.10000 t.10000 t.10000 &gt; t.30000
$ cat t.100 t.300 &gt; t.400
$ cat t.1000 t.3000 &gt; t.4000
$ cat t.10000 t.30000 &gt; t.40000
&lt;/c&gt;
&lt;p&gt;I've been monkeying with some different bits, but my best two (so far) give me the times:&lt;/p&gt;
&lt;c&gt;
num      brute
strings  force    Robo1  Robo2
-------  -------  -----  -----
    100     .125   .125   .110
    200     .234   .172   .125
    300     .202   .141   .110
    400     .234   .156   .110
    500    1.030   .187   .125
   1000    3.916   .265   .188
   2000   15.288   .390   .265
   3000   11.435   .546   .328
   4000   15.319   .656   .422
   5000   93.600   .858   .546
  10000  377.412  1.638  1.029
  20000           3.151  1.981
  30000           4.493  2.621
  40000           5.866  3.417
  50000                  4.929
&lt;/c&gt;
&lt;p&gt;I then created a few datasets with strings between 200 and 300 characters to see how my better one did:&lt;/p&gt;
&lt;c&gt;
# str     Robo2  Notes
------   ------  --------------
  1000    0.687  unique
  2000    1.264  1000 unique
 10000    6.412  unique
 20000   11.887  10000 unique
100000   65.224  unique
200000  126.190  100000 unique
&lt;/c&gt;
&lt;p&gt;I'll wait a little while before posting my solution, as I don't want to spoil things for people still working on it right now.&lt;/p&gt;
&lt;p&gt;...[roboticus]&lt;/p&gt;
&lt;p&gt;&lt;i&gt;When your only tool is a hammer, all problems look like your thumb.&lt;/i&gt;&lt;/p&gt;</field>
<field name="root_node">
906020</field>
<field name="parent_node">
906020</field>
</data>
</node>
