http://www.perlmonks.org?node_id=907255


in reply to Re^4: list of unique strings, also eliminating matching substrings
in thread list of unique strings, also eliminating matching substrings

Try this on that same dataset and let me know how you get on. On my generated test data it takes ~30 minutes for 200,000 strings, but how realistic that dataset is ...

Use thisScript.pl inFile > outFile:

#! perl -slw use strict; use Inline C => Config => BUILD_NOISY => 1; use Inline C => <<'END_C', NAME => '_906020', CLEAN_AFTER_BUILD => 0; #include <string.h> int longCmp( SV *needle, SV *haystack, SV *offset ) { STRLEN ln, lh, o = SvIV( offset ); char *n = SvPV( needle, ln ), *h = SvPV( haystack, lh ); char *nl = n + ln - 1; int diff = lh - ln; int flag = 0, i; h += o; lh -= o; diff -= o; if( diff <= 0 ) return 0; for( i = 0; i < diff; ++i ) { if( ! h[ i + ln - 1 ] ) { i += ln; continue; } if( h[ i ] != *n || h[ i+ ln-1 ] != *nl ) continue; if( strncmp( h+i, n, ln ) ) continue; return i; } return 0; } END_C use Time::HiRes qw[ time ]; sub uniq{ my %x; @x{@_} = (); keys %x } my $start = time; my @uniq = uniq <>; chomp @uniq; @uniq = sort{ length $a <=> length $b } @uniq; my $all = join chr(0), @uniq; my $p = 0; for my $x ( @uniq ) { $p += 1+ length $x; next if longCmp( $x, $all, $p ); print $x; } printf STDERR "Took %.3f\n", time() - $start; __END__

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

Replies are listed 'Best First'.
Re^6: list of unique strings, also eliminating matching substrings
by lindsay_grey (Novice) on May 30, 2011 at 20:07 UTC

    I seem to be having a problem with Inline. Maybe I just need to get one of the "apple-darwin" files it is looking for. I will keep trying. Thanks for posting your code.

    Starting "make" Stage i686-apple-darwin10-gcc-4.2.1: Module: No such file or directory i686-apple-darwin10-gcc-4.2.1: no input files i686-apple-darwin10-gcc-4.2.1: Module: No such file or directory i686-apple-darwin10-gcc-4.2.1: no input files powerpc-apple-darwin10-gcc-4.2.1: Module: No such file or directory powerpc-apple-darwin10-gcc-4.2.1: no input files lipo: can't figure out the architecture type of: /var/folders/Jx/Jx+cO +TNTFTSKERHsDjE+nU+++TI/-Tmp-//cch73GJV.out make: *** [_906020.o] Error 1 A problem was encountered while attempting to compile and install your + Inline C code. The command that failed was: make

      That looks like you do not have Inline::C installed correctly, but I can't help you with that. If it is the case...ie. if the Inline::C installation tests are failing, then you shoudl post a new thread about that to get help.

      In the interim, you can try this pure perl version which is only half as fast as the inline version, but that should still be 7 times faster than your current solution. Let me know how you get on.

      #! perl -slw use strict; use Time::HiRes qw[ time ]; $|++; sub uniq{ my %x; @x{@_} = (); keys %x } my $start = time; my @uniq = uniq <>; chomp @uniq; @uniq = sort{ length $a <=> length $b } @uniq; my $all = join chr(0), @uniq; my $p = 0; for my $x ( @uniq ) { $p += 1+ length $x; next if 1+ index $all, $x, $p; ## COrrected per LanX below. print $x; } printf STDERR "Took %.3f\n", time() - $start;

      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        just noticed that index returns -1 for a missing match.

        you say this worked?

            next if index $all, $x, $p;

        did you manipulate $[ somewhere???

        Cheers Rolf

        I had a similar idea but with some modifications:

        1. starting with the longest string and continuing in descending order

        2. then only appending the non-embeddable strings to $all

        like this $all is in average significantly shorter and the tests with index should be faster.

        I'm also wondering if the reallocation of new memory when appending to $all could be avoided by starting with a maximal length string and then shortening $all again.

        Maybe uniq() from List::MoreUtils is faster or could be completely avoided (after sorting identical strings always appear in a sequence)

        All of this highly depends on the nature of the unknown data and should only be tested with identical sets...

        Cheers Rolf