Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

Re^7: Words in Words

by choroba (Abbot)
on Oct 01, 2011 at 13:43 UTC ( #929033=note: print w/ replies, xml ) Need Help??


in reply to Re^6: Words in Words (Updated)
in thread Words in Words

I slightly modified my script:

#!/usr/bin/perl use feature 'say'; use warnings; use strict; my $file = 'words.txt'; open my $IN, '<', $file or die "$!"; my %words; while (my $word = <$IN>) { chomp $word; undef $words{$word}; } my %reported; for my $word (keys %words) { my $length = length $word; for my $pos (0 .. $length - 1) { my $skip_itself = ! $pos; for my $len (1 .. $length - $pos - $skip_itself) { my $subword = substr($word, $pos, $len); next if exists $reported{$subword}; next if $word eq $subword . q{s} or $word eq $subword . q{'s}; if (exists $words{$subword}) { say "$subword"; undef $reported{$subword}; } } } }
I used english.0 from this archive as words.txt: http://downloads.sourceforge.net/wordlist/ispell-enwl-3.1.20.zip. Your script took 58s, whilest mine only 6s (on Pentium 4, 2.8 GHz). The results were different, though: your output contains the word indistinguishableness that mine does not; my list contained 911 more words than yours (e.g. you, wraps or tribe's).


Comment on Re^7: Words in Words
Select or Download Code
Re^8: Words in Words
by BrowserUk (Pope) on Oct 01, 2011 at 14:55 UTC
    your output contains the word indistinguishableness

    Does your word list contain the "word": indistinguishablenessess?


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      The "word" is in words.txt. But a strange thing - on the other machine I'm working with now, the word is not in the output of your script.

        The last version of my script with the sort by length optimisation is broken. It starts each search into the big string in slightly the wrong places. Hence why it doesn't find as many words as it should. It could be fixed, but it will never be as quick as your approach.

        More intriguing is if your list contains 'indistinguishablenessess', & 'indistinguishableness', why doesn't yours output the latter? I didn't expend any time on it, but I couldn't see how it could miss.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re^8: Words in Words
by BrowserUk (Pope) on Oct 01, 2011 at 19:01 UTC

    Congratulations! You have the hands down winner as far as I can see.

    Hash lookup beats searching every time, but the vision to invert the logic so the lookup is possible, is quite brilliant IMO.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re^8: Words in Words
by sarchasm (Acolyte) on Oct 02, 2011 at 20:28 UTC

    WOW!

    I ran one of the other scripts and it took just under 24 hours to complete and I didn't get the answer I was expecting.

    Your script ran in 40 seconds and gave me exactly what I was looking for!

    Would you be willing to explain how this works? I get the declaration of the hash, the while loop to load the file (not sure what "undef $words{$word};" does) but the rest is pure magic!

    Thank you so much for putting together this solution...I am truely blown away.

    I tried to do this using T-SQL when I first encountered the problem but that was taking forever. Then I "tried" to use PERL but had way too many questions to get it to do what I needed. Your solution is awesome!

    Thanks again!
      OK. In the while loop, I create a hash whose keys are the words from the list (there is no value, that's why the undef).

      Then, I go through the words one by one. For each word, I try all the positions and all possible lengths of its subwords (and I skip the maximal length at position 0, because that would breake the rule #2). For each subword, I do nothing if it has already been printed out (each word should be reported just once). I do nothing if rules #3 or #4 apply. Otherwise, I check whether the subword is itself on the list of words. If it is, I report it and book it as such. And that's it.

      The basic idea was this: Comparing each word to all other words would take ages. There would be many comparisons of words that are totally incompatible. How can I reduce the number of comparisons? I do not need all the words, I only need those that are possible for the given word.

      As I read the code know, I think it might be optimized a bit further. Instead of caching the reported subwords, you can cache the tested ones (i.e. move the undef three lines up, before the "if"). %reported should be renamed to %checked then.

        Got it!

        The only piece in the code I am wondering about is where you are checking for rules 3 and 4 and use "$subword . q{s}..."

        I know what it is doing but where did the "q" come from?

        Thank you!

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://929033]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others contemplating the Monastery: (10)
As of 2014-09-02 12:11 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite cookbook is:










    Results (22 votes), past polls