Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Re^5: Words in Words

by sarchasm (Acolyte)
on Sep 30, 2011 at 23:18 UTC ( #928960=note: print w/ replies, xml ) Need Help??


in reply to Re^4: Words in Words
in thread Words in Words

It looks like both solutions will work!

One thing I just realized from your post about sorting is that you only need to look at words that are longer than the current word (which you are sortof doing). This means that as the program runs, it actually becomes faster at finding the results.

I ran each program for 1 minute and BrowserUk's code produced 320 records. Lotus1's code produced 150. Even though your code appears to run slower I imagine performance will improve the longer the process runs because it will have fewer records to look through each time.

I will let the programs run over the weekend to see what I get.

Thank you all for your help. I learned a lot from your examples and suggestions!


Comment on Re^5: Words in Words
Re^6: Words in Words
by BrowserUk (Pope) on Oct 01, 2011 at 00:40 UTC

    Another tweak should improve performance again:

    #! perl -slw use strict; my @words = do{ local @ARGV = 'words.txt'; <> }; chomp @words; my $all = join ' ', @words; my $start = time; for my $i ( @words ) { while( $all =~ m[ ([^ ]*$i[^ ]*) ]g ) { my $j = $1; next if $j eq $i or $j eq "${i}s" or $j eq "${i}'s"; print "$j contains $i"; last; ## Added } } printf STDERR "Took %d seconds for %d words\n", time() - $start, scalar @words;

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re^6: Words in Words (Updated)
by BrowserUk (Pope) on Oct 01, 2011 at 10:54 UTC

    Update: Evidently this is a step too far as it produces the wrong results. It could (probably) be fixed, but it will never beat choroba's solution below.

    My final offering. Combining Lotus1's sort by length with my big-string approach and this really flies, beating my previous best by an order of magnitude:

    Ignore!

    #! perl -slw use strict; my @words = sort{ length($a) <=> length($b) } do{ local @ARGV = 'words.txt'; <> }; chomp @words; my $start = time; my $all = join ' ', @words; study $all; my @offsets; for my $l ( 1 .. 20 ) { push @offsets, $all =~ m[ ([^ ]{$l}) ] ? $-[0] : $offsets[-1]; } for my $i ( @words ) { while( substr( $all, $offsets[ length( $i ) +1 ] ) =~ m[ ([^ ]*$i[^ ]*) ]g ) { my $j = $1; next if $j eq $i or $j eq "${i}s" or $j eq "${i}'s"; print $i; last; } } printf STDERR "Took %d seconds for %d words\n", time() - $start, scalar @words;

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      I slightly modified my script:
      #!/usr/bin/perl use feature 'say'; use warnings; use strict; my $file = 'words.txt'; open my $IN, '<', $file or die "$!"; my %words; while (my $word = <$IN>) { chomp $word; undef $words{$word}; } my %reported; for my $word (keys %words) { my $length = length $word; for my $pos (0 .. $length - 1) { my $skip_itself = ! $pos; for my $len (1 .. $length - $pos - $skip_itself) { my $subword = substr($word, $pos, $len); next if exists $reported{$subword}; next if $word eq $subword . q{s} or $word eq $subword . q{'s}; if (exists $words{$subword}) { say "$subword"; undef $reported{$subword}; } } } }
      I used english.0 from this archive as words.txt: http://downloads.sourceforge.net/wordlist/ispell-enwl-3.1.20.zip. Your script took 58s, whilest mine only 6s (on Pentium 4, 2.8 GHz). The results were different, though: your output contains the word indistinguishableness that mine does not; my list contained 911 more words than yours (e.g. you, wraps or tribe's).
        your output contains the word indistinguishableness

        Does your word list contain the "word": indistinguishablenessess?


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        Congratulations! You have the hands down winner as far as I can see.

        Hash lookup beats searching every time, but the vision to invert the logic so the lookup is possible, is quite brilliant IMO.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        WOW!

        I ran one of the other scripts and it took just under 24 hours to complete and I didn't get the answer I was expecting.

        Your script ran in 40 seconds and gave me exactly what I was looking for!

        Would you be willing to explain how this works? I get the declaration of the hash, the while loop to load the file (not sure what "undef $words{$word};" does) but the rest is pure magic!

        Thank you so much for putting together this solution...I am truely blown away.

        I tried to do this using T-SQL when I first encountered the problem but that was taking forever. Then I "tried" to use PERL but had way too many questions to get it to do what I needed. Your solution is awesome!

        Thanks again!

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://928960]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others lurking in the Monastery: (16)
As of 2014-09-17 16:00 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (90 votes), past polls