Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Re^3: list of unique strings, also eliminating matching substrings

by BrowserUk (Pope)
on May 21, 2011 at 05:53 UTC ( #906044=note: print w/replies, xml ) Need Help??


in reply to Re^2: list of unique strings, also eliminating matching substrings
in thread list of unique strings, also eliminating matching substrings

Presumably you've code the obvious two loops method and it is taking too long. Could you supply a timing for one of your datasets?


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
  • Comment on Re^3: list of unique strings, also eliminating matching substrings

Replies are listed 'Best First'.
Re^4: list of unique strings, also eliminating matching substrings
by lindsay_grey (Novice) on May 29, 2011 at 22:43 UTC

    7 hours! for about 200,000 starting strings.

      Try this on that same dataset and let me know how you get on. On my generated test data it takes ~30 minutes for 200,000 strings, but how realistic that dataset is ...

      Use thisScript.pl inFile > outFile:

      #! perl -slw use strict; use Inline C => Config => BUILD_NOISY => 1; use Inline C => <<'END_C', NAME => '_906020', CLEAN_AFTER_BUILD => 0; #include <string.h> int longCmp( SV *needle, SV *haystack, SV *offset ) { STRLEN ln, lh, o = SvIV( offset ); char *n = SvPV( needle, ln ), *h = SvPV( haystack, lh ); char *nl = n + ln - 1; int diff = lh - ln; int flag = 0, i; h += o; lh -= o; diff -= o; if( diff <= 0 ) return 0; for( i = 0; i < diff; ++i ) { if( ! h[ i + ln - 1 ] ) { i += ln; continue; } if( h[ i ] != *n || h[ i+ ln-1 ] != *nl ) continue; if( strncmp( h+i, n, ln ) ) continue; return i; } return 0; } END_C use Time::HiRes qw[ time ]; sub uniq{ my %x; @x{@_} = (); keys %x } my $start = time; my @uniq = uniq <>; chomp @uniq; @uniq = sort{ length $a <=> length $b } @uniq; my $all = join chr(0), @uniq; my $p = 0; for my $x ( @uniq ) { $p += 1+ length $x; next if longCmp( $x, $all, $p ); print $x; } printf STDERR "Took %.3f\n", time() - $start; __END__

      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

        I seem to be having a problem with Inline. Maybe I just need to get one of the "apple-darwin" files it is looking for. I will keep trying. Thanks for posting your code.

        Starting "make" Stage i686-apple-darwin10-gcc-4.2.1: Module: No such file or directory i686-apple-darwin10-gcc-4.2.1: no input files i686-apple-darwin10-gcc-4.2.1: Module: No such file or directory i686-apple-darwin10-gcc-4.2.1: no input files powerpc-apple-darwin10-gcc-4.2.1: Module: No such file or directory powerpc-apple-darwin10-gcc-4.2.1: no input files lipo: can't figure out the architecture type of: /var/folders/Jx/Jx+cO +TNTFTSKERHsDjE+nU+++TI/-Tmp-//cch73GJV.out make: *** [_906020.o] Error 1 A problem was encountered while attempting to compile and install your + Inline C code. The command that failed was: make
Re^4: list of unique strings, also eliminating matching substrings
by lindsay_grey (Novice) on May 29, 2011 at 22:51 UTC
    7 hours! starting with about 200,00 strings. Thanks!

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://906044]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others having an uproarious good time at the Monastery: (5)
As of 2021-06-14 06:06 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    What does the "s" stand for in "perls"? (Whence perls)












    Results (60 votes). Check out past polls.

    Notices?