Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things

Words in Words

by sarchasm (Acolyte)
on Sep 30, 2011 at 17:51 UTC ( #928877=perlquestion: print w/replies, xml ) Need Help??
sarchasm has asked for the wisdom of the Perl Monks concerning the following question:

Good day! I'm trying to write a script that takes a list of words and checks to see if each word exists within another word in the list. The specifics are as follows:

1) The word list contains 640,000 entries

2) A word cannot match itself (ie: "a" cannot match "a")

3) A word cannot match itself as a plural even if it makes a different word (ie: "a" cannot match "as")

4) A word cannot match itself with an apostrophe s "'s" (ie: "a" cannot match "a's" but "a" can match "aa's")

Coming from a mainframe background I am trying to use loops but the performance is horrible. From what i've read, hashing seems to be the way to go but I think I am still implementing this as a loop and getting very poor performance (100 records in 20 seconds).

Here is what i've tried so far:

use warnings; use strict; #define constants my $datapath="F:\\wordsinwords\\"; my $wordfile= $datapath."wordlist.txt"; #define variables my $outrecs=0; my $word; open LOG, ">".$datapath."wordsinwords_LOG.txt" or die $!; select LOG; $|=1; #read the wordlist file into an array open (WORDFILE, $wordfile); chomp (@words = (<WORDFILE>)); close (WORDFILE); #main #coerce the array into a hash %hash = map { $_ => 1 } @words; #search for matches #I have no idea how to put a single #regex together that could meet all of the criteria so #I was going to run this multiple times to target specific #criteria until I found all permutations. foreach $word (keys %hash) { $outrecs++ if /.+$searchword/ ~~ %hash; } #=============================================== # This was an attempt at using an array # but it was also very slow #=============================================== #foreach $word (@words) { # $outrecs++ if ($found) = grep (/.+?$word/, @words); #} #close files & write out completion log print LOG "Created output file with: ".$outrecs." records.\n"; close LOG;
Any suggestions would be greatly appreciated. Thank you!

Replies are listed 'Best First'.
Re: Words in Words
by BrowserUk (Pope) on Sep 30, 2011 at 18:57 UTC

    Try this. I project that it should complete your 410 billion comparisons in a little under 10 hours.

    The main attempt at efficiency here is to invoke the regex once in global mode (/g) for each word, against a single large string containing all the words and have it return all the matches. It then filters just the matching ones for your specific exclusions.

    #! perl -slw use strict; my @words = do{ local @ARGV = 'words.txt'; <> }; chomp @words; my $all = join ' ', @words; my $start = time; my $n = 0; for my $i ( @words ) { for my $j ( $all =~ m[ ([^ ]*$i[^ ]*) ]g ) { next if $j eq $i or $j eq "${i}s" or $j eq "${i}'s"; # print "$j contains $i"; } } printf STDERR "Took %d seconds for %d words\n", time() - $start, scalar @words;

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      Your approach does n^2 comparisons with n=(number of words in list). I took a different aproach of sorting the list then comparing each word only with the following words in the array. This cuts the comparisons by half when n gets large as in this case. I'm curious how grep with the regex will stand up against your approach.

      As for the extra overhead for sorting the word list my guess is the extra sorts at the beginning pay off as n gets large. What do you think? I have been wanting to try out benchmarking for some time after seeing you use it so much. This could be a good opportunity.

      use warnings; use strict; open LOG, ">", 'wordsinwords_LOG.txt' or die $!; #Sort alphabetically first so identical words end up together. chomp (my @words = sort { length($a) <=> length($b) } sort <DATA>); @words = map {lc} @words; for (my $i=0; $i<$#words; $i++) { my $word = $words[$i]; next if $word eq $words[$i+1]; my @matched = grep {/$word/ and $_ ne "${word}s" and $_ ne "${word +}'s" } @words[$i+1..$#words]; print LOG "$word => @matched\n" if @matched; } __DATA__ at Ate crate tree Trip tread read ad at ads crate's

      Log file contains:

      ad => read tread at => ate crate crate's ate => crate crate's read => tread

      This is interesting. SCOWL (Spell Checker Oriented Word Lists) has 652,000 words plus a range of smaller lists.

        Sorry, but only comparing words with those that come after it alphabetically doesn't work.

        For example, the work 'call' appears in all these words that precede it alphabetically:

        abiogenically abiotically academically acerbically achromatically acoustically acrobatically acronymically acrostically actinically adiabatically adrenergically aerobically aerodynamically aeronautically aesthetically agonistically agronomically alchemically alcoholically algebraically algorithmically allegorically allopatrically allosterically allotypically alogically alphabetically altruistically amitotically anacoluthically anaerobically anagogically analogically analytically anaphorically anarchically anatomically anchoritically anecdotically anemically anesthetically angelically animatronically anisotropically anodically antibiotically antically antidromically antigenically antiseptically antithetically aoristically apathetically aperiodically aphetically aphoristically apically apocalyptically apodictically apolitically apologetically apomictically apoplectically aposematically apotropaically aquatically archaically arctically arithmetically aromatically arthritically artistically ascetically aseptically asthmatically astrologically astronautically astronomically astrophysically asymmetrically asymptotically asyndetically atavistically atheistically athletically atmospherically atomically atomistically atypically authentically autistically autocratically autographically automatically autonomically autotrophically axenically axiologically axiomatically ballistically barbarically barometrically basically bathetically bathymetrically beatifically biblically biochemically biogenetically biographically biologically biomechanically birdcall birdcalls bolometrically bombastically botanically buccally bucolically caecally call

        And the word 'the' appears in 1300 words that precede it in my 178,000 word dictionary.

        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        SCOWL is the wordlist that I am using!. I just combined the 256 files and pulled a distinct list of words

        I am starting to look at trying the code you provided but need to figure out where you are defining what is loaded into <DATA>

        I will let you know what I find

        Thank you.
Re: Words in Words
by thezip (Vicar) on Sep 30, 2011 at 18:06 UTC

    While I'm not an expert on tries, there seems to be a lot of potential there for your stated problem. Give it a look.

    Here's a trie implementation in Perl.

    What can be asserted without proof can be dismissed without proof. - Christopher Hitchens
      yes searching in Suffix_tree is O(1), but constructing those trees implies a overhead.

      Since in our case all searched strings are also known in advance, I'm not sure that this is really the fastest approach.

      At least the construction process will already identify embedded strings, since they are leaves of the suffix tree.

      Cheers Rolf

Re: Words in Words
by ww (Bishop) on Sep 30, 2011 at 19:29 UTC

    Re Line 23, you MAY speed things up a little by reading $wordfile directly into a hash. Searching Google for +Perl +'file to hash' produces some relevant results in prior discussions here in the Monastery. Strike the site specification and you'll find some other sources (albeit, of unknown reliability).

    Even with a wordlist of the size you're using, that runtime seems far to the high-side. Is the box upon which you're writing this reasonably current (fast?) and what version of Perl are you using?

    Counting the first para above, you now have three reasonable alternatives (update: and one very reasonable question about the precision of your spec </update>) ... so the rest of this node will focus on some nits.

    Line 4 seems to reflect a view that $datapath and $wordfile are constants. While you're using them within the scope of the conventional meaning of "constant," they aren't in the sense of allowing the compiler to optomize.

    Lines 9 and 10 declare global variables... not, IMO, as major problem here, but in more complex programs, it's wise to declare them in such a way as to minimize their scope (for more on this, try perldoc -q scope at your CLI).

    Your comment at Line 26 reflect what is actually ( IMO, usually [...additional qualifiers may be required] ) a better (if not "best") practice. The more complex you make your regex, the more opportunities you'll have to obscure a logic problem and/or to create something that sets the regex engine thrashing.

      I'm using Perl 5.14.1 on Windows 2008R2 i7 processor with 16 gigs of RAM. It should be plenty fast if the correct solution is applied. I will look into the nits (which I appreciate you calling out).
Re: Words in Words
by choroba (Chancellor) on Sep 30, 2011 at 19:53 UTC
    Did I understand points 1) and 2) correctly? This script finishes a list of almost 640,000 entries in less then a minute (50sec), adding conditions 3) and 4) should be really easy.
    #!/usr/bin/perl use feature 'say'; use warnings; use strict; my $file = '/etc/dictionaries-common/words'; open my $IN, '<', $file or die "$!"; my %words; while (my $word = <$IN>) { chomp $word; undef $words{$word}; } for my $word (keys %words) { my $length = length $word; my %found; # report each occurence just once for my $pos (0 .. $length - 1) { my $skip_itself = ! $pos; for my $len (1 .. $length - $pos - $skip_itself) { my $subword = substr($word, $pos, $len); next if exists $found{$subword}; if (exists $words{$subword}) { say "$subword in $word"; undef $found{$subword}; } } } }
    Update: if I ommit the "just once" condition, the script finishes in 40 secs on my Mac Mini.

      I may not have clarified the specs as well as I had hoped.

      The results should be a distinct list of the words in the list that can be found contained within another word in the list taking into consideration the exclusions I listed.

      When I run this code, it returns words that are not contained in the wordlist. They are just pieces of the word.

      I will say that it is pretty fast though and I may be able to use this technique to get what I am looking for.

      Thank you.
        When I run this code, it returns words that are not contained in the wordlist. They are just pieces of the word.
        Have you changed the path to the input file? The code should only print words from the list!
Re: Words in Words
by jdporter (Canon) on Sep 30, 2011 at 18:59 UTC
    checks to see if each word exists within another word in the list

    Impossible; such a list would be circularly containing! Therefore, the answer is always false. O(1)

    Or did you mean, checks each word to see whether it exists within another word in the list?

      Yes. I do mean the program should, as you stated, "checks each word to see whether it exists within another word in the list"
Re: Words in Words
by chrestomanci (Priest) on Sep 30, 2011 at 20:23 UTC

    Instead of using regular expresson to check if one word is a substring of another, why not use index instead.

    It is a simple function that returns the offset of the shorter string within the longer one if it is a substring, or -1 it it is not. The kind of thing you expect to find in the string libaries of programing languages without regular expresson support, but so rarely used by perl programmers that most of us have forgotten the function exists.

      is index faster than regexp for fixed text token? makes it look like index doesn't help much but I just took a quick look at that long discussion. My guess is the comparison with -1 after every index eats up any savings in search time. Plus the regex looks simpler. (update: simpler to Perl programmers anyway ;)

      index is normally slower then a simple regex, because regex are optimized by pre-calculating jump tables ... ( I forgot the name of the algorithm ...).

      Cheers Rolf

      UPDATE: see Boyer–Moore_string_search_algorithm

      UPDATE2: choroba found it, too! :)

        Something like Boyer-Moore?
Re: Words in Words
by aartist (Scribe) on Oct 01, 2011 at 01:04 UTC
    What is the purpose of this exercise? May be it is an XY problem and your needs can be fulfilled in a better way. Even if you find a way to do what you have asked, what you are going to do with the results? If we know before and after, we may direct you to a better version of answer.

      This was simply a challenge put out by a friend.

      I tried to make this work in a couple of the languages I use most often (T-SQL, COBOL, and C#) but I was using a looping method that just was not very efficient. I tried to write something in PERL because I've used it to split files and for very minor file manipulation but again my approach was looping based. One of my friends suggested using a hash but my implementation of the hash was to find results using grep and then loop through.

      Finally I realized that I couldn't come up with an appropriate solution and sought the wisdom of the experts and it has been a tremendous learning experience.

      As for the use, this was simply an exercise to get the correct answer with the most efficient code. I was able to get the correct answer but my runtimes were longer than a day. One of my friends said he solved it with hashing and his program ran in 5 minutes.

      I had the answer but wanted to see what the most efficient way of solving this problem.

      Thank you for your participation!

Re: Words in Words
by Limbic~Region (Chancellor) on Oct 01, 2011 at 23:29 UTC
Re: Words in Words
by LanX (Chancellor) on Oct 01, 2011 at 13:25 UTC

    > 3) A word cannot match itself as a plural even if it makes a different word (ie: "a" cannot match "as")

    Please define "plural"!

    In English "children" is the plural of "child", "mice" of "mouse".

    And IMHO there are examples where WORDs is not the plural of WORD.

    (UPDATE: Like "is" and "I")

    Cheers Rolf

      I should have provided a simple example because as you pointed out nothing is simple with English.

      What I meant to say is:

      "Word1" cannot equal "Word1" "Word1" cannot equal "Word1" + "s" "Word1" cannot equal "Word1" + "'s" That's it!
Re: Words in Words
by perlaintdead (Scribe) on Sep 30, 2011 at 19:29 UTC

    do some research on regular expressions.