Re^6: list of unique strings, also eliminating matching substrings

Replies are listed 'Best First'.
Re^7: list of unique strings, also eliminating matching substrings by BrowserUk (Patriarch) on May 30, 2011 at 20:24 UTC
That looks like you do not have Inline::C installed correctly, but I can't help you with that. If it is the case...ie. if the Inline::C installation tests are failing, then you shoudl post a new thread about that to get help. In the interim, you can try this pure perl version which is only half as fast as the inline version, but that should still be 7 times faster than your current solution. Let me know how you get on. `#! perl -slw use strict; use Time::HiRes qw[ time ]; $\|++; sub uniq{ my %x; @x{@_} = (); keys %x } my $start = time; my @uniq = uniq <>; chomp @uniq; @uniq = sort{ length $a <=> length $b } @uniq; my $all = join chr(0), @uniq; my $p = 0; for my $x ( @uniq ) { $p += 1+ length $x; next if 1+ index $all, $x, $p; ## COrrected per LanX below. print $x; } printf STDERR "Took %.3f\n", time() - $start;` [download] Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l]
Re^8: list of unique strings, also eliminating matching substrings by LanX (Saint) on Jun 02, 2011 at 15:05 UTC
I had a similar idea but with some modifications: 1. starting with the longest string and continuing in descending order 2. then only appending the non-embeddable strings to `$all` like this `$all` is in average significantly shorter and the tests with index should be faster. I'm also wondering if the reallocation of new memory when appending to `$all` could be avoided by starting with a maximal length string and then shortening `$all` again. Maybe `uniq()` from List::MoreUtils is faster or could be completely avoided (after sorting identical strings always appear in a sequence) All of this highly depends on the nature of the unknown data and should only be tested with identical sets... Cheers Rolf	[reply] [d/l] [select]
Re^9: list of unique strings, also eliminating matching substrings by BrowserUk (Patriarch) on Jun 03, 2011 at 05:37 UTC
1. starting with the longest string and continuing in descending order I don't get the idea of putting the longest first? The idea of putting the shortest first is that you can use the third parameter to index to skip over the shorter strings as you've checked them. Longer strings can never be contained by the shorter ones, and starting the search part way into the string is much cheaper than trimming the shorter ones off the end. 2. then only appending the non-embeddable strings to $all I do not know what you mean by "non-embeddable" in this context? I'm also wondering if the reallocation of new memory when appending to $all could be avoided by starting with a maximal length string and then shortening $all again. If you mean counting the space required for `$all`, allocating to that final size and then copying the elements into the string--rather than building it up by appending each element in turn--that is exactly what join does. Maybe uniq() from List::MoreUtils is faster Not in my tests. Mine usually works out ~15% faster. or could be completely avoided (after sorting identical strings always appear in a sequence) That would mean sorting the duplicates. Sorting is O(N log N); de-duping O(1). And after the sorting, you;d still need to make a complete pass with grep to remove the dups before joining. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l] [select]
Re^10: list of unique strings, also eliminating matching substrings by LanX (Saint) on Jun 03, 2011 at 11:52 UTC
Re^11: list of unique strings, also eliminating matching substrings by BrowserUk (Patriarch) on Jun 03, 2011 at 14:38 UTC
Some notes below your chosen depth have not been shown here
Re^8: list of unique strings, also eliminating matching substrings by LanX (Saint) on Jun 02, 2011 at 16:02 UTC
just noticed that index returns -1 for a missing match. you say this worked? `next if index $all, $x, $p;` did you manipulate $[ somewhere??? Cheers Rolf	[reply] [d/l]
Re^9: list of unique strings, also eliminating matching substrings by BrowserUk (Patriarch) on Jun 02, 2011 at 16:22 UTC
You're right. I edited rather than c&p :( and forgot my usual 1+. `next if 1+index $all, $x, $p;` [download] Code above amended. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l]


The stupid question is the question not asked
	PerlMonks