Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

Re^8: list of unique strings, also eliminating matching substrings

by LanX (Saint)
on Jun 02, 2011 at 15:05 UTC ( [id://907814]=note: print w/replies, xml ) Need Help??


in reply to Re^7: list of unique strings, also eliminating matching substrings
in thread list of unique strings, also eliminating matching substrings

I had a similar idea but with some modifications:

1. starting with the longest string and continuing in descending order

2. then only appending the non-embeddable strings to $all

like this $all is in average significantly shorter and the tests with index should be faster.

I'm also wondering if the reallocation of new memory when appending to $all could be avoided by starting with a maximal length string and then shortening $all again.

Maybe uniq() from List::MoreUtils is faster or could be completely avoided (after sorting identical strings always appear in a sequence)

All of this highly depends on the nature of the unknown data and should only be tested with identical sets...

Cheers Rolf

Replies are listed 'Best First'.
Re^9: list of unique strings, also eliminating matching substrings
by BrowserUk (Patriarch) on Jun 03, 2011 at 05:37 UTC

    1. 1. starting with the longest string and continuing in descending order

      I don't get the idea of putting the longest first?

      The idea of putting the shortest first is that you can use the third parameter to index to skip over the shorter strings as you've checked them. Longer strings can never be contained by the shorter ones, and starting the search part way into the string is much cheaper than trimming the shorter ones off the end.

    2. 2. then only appending the non-embeddable strings to $all

      I do not know what you mean by "non-embeddable" in this context?

    3. I'm also wondering if the reallocation of new memory when appending to $all could be avoided by starting with a maximal length string and then shortening $all again.

      If you mean counting the space required for $all, allocating to that final size and then copying the elements into the string--rather than building it up by appending each element in turn--that is exactly what join does.

    4. Maybe uniq() from List::MoreUtils is faster

      Not in my tests. Mine usually works out ~15% faster.

    5. or could be completely avoided (after sorting identical strings always appear in a sequence)

      That would mean sorting the duplicates. Sorting is O(N log N); de-duping O(1). And after the sorting, you;d still need to make a complete pass with grep to remove the dups before joining.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      hmm ... actually I didn't want to invest time coding for this xy-nonsense.

      But maybe it's easier to show code instead of explaining the theory

      (UPDATE-06-03 13:37 GMT: fixed bug in initialization and missing uniqifying)

      --- Init ... ... completed: 2000 snippets from 200 to 400 of 6000 long DNA Check_all took 1.206 Filtered: 287 Ratio: 6.96864111498258 Check_longest took 0.436 Filtered: 287 Ratio: 6.96864111498258 Faster: 2.76641019482413 --- Init ... ... completed: 2000 snippets from 200 to 400 of 60000 long DNA Check_all took 2.990 Filtered: 924 Ratio: 2.16450216450216 Check_longest took 1.908 Filtered: 924 Ratio: 2.16450216450216 Faster: 1.56725918185693 --- Init ... ... completed: 2000 snippets from 200 to 400 of 600000 long DNA Check_all took 4.462 Filtered: 1785 Ratio: 1.12044817927171 Check_longest took 4.058 Filtered: 1785 Ratio: 1.1204481792717 +1 Faster: 1.09963952407683 --- Init ... ... completed: 2000 snippets from 200 to 400 of 6000000 long DNA Check_all took 4.601 Filtered: 1977 Ratio: 1.01163378856854 Check_longest took 4.651 Filtered: 1977 Ratio: 1.0116337885685 +4 Faster: 0.989378646160973

      Cheers Rolf

        hmm ... actually I didn't want to invest time coding for this xy-nonsense.

        Sorry, but no one forced you to. You asked questions, I did my best to answer them. That''s all.

        FWIW: Converted to a form that allows it to be used in a realistic and repeatable scenario:

        #! perl -slw use strict; use Time::HiRes qw[ time ]; $|++; sub uniq{ my %x; @x{@_} = (); keys %x } my $start = time; my @uniq = uniq <>; chomp @uniq; @uniq = sort{ length $b <=> length $a } @uniq; my $longest = shift @uniq; for my $x ( @uniq ) { next if 1+ index $longest, $x; print $x; $longest .= "\n" . $x; } printf STDERR "Took %.3f\n", time() - $start;

        Whilst you're ~10% quicker on low numbers of strings:

        c:\test>906020 906020.10e3 > 906020.filtered Took 48.854 c:\test>wc -l 906020.filtered 5000 906020.filtered c:\test>906020-lanx 906020.10e3 > lanx.filtered Took 43.122 c:\test>wc -l lanx.filtered 4999 lanx.filtered c:\test>906020 906020.10e3 > 906020.filtered (inline version) Took 21.744 c:\test>wc -l 906020.filtered 5000 906020.filtered

        As your own timings show, as the numbers of strings increase, the cost of constantly reallocating your accumulator string in order to append the new one starts to dominate. I suspect that by the time you get to the OPs 200,000 strings you going to be considerably slower. (You also have an out-by-one error somewhere, but that is probably easily fixed.)


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://907814]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others goofing around in the Monastery: (5)
As of 2024-04-19 13:39 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found