Re: Words in Words
by BrowserUk (Patriarch) on Sep 30, 2011 at 18:57 UTC
|
Try this. I project that it should complete your 410 billion comparisons in a little under 10 hours.
The main attempt at efficiency here is to invoke the regex once in global mode (/g) for each word, against a single large string containing all the words and have it return all the matches. It then filters just the matching ones for your specific exclusions.
#! perl -slw
use strict;
my @words = do{ local @ARGV = 'words.txt'; <> };
chomp @words;
my $all = join ' ', @words;
my $start = time;
my $n = 0;
for my $i ( @words ) {
for my $j ( $all =~ m[ ([^ ]*$i[^ ]*) ]g ) {
next
if $j eq $i
or $j eq "${i}s"
or $j eq "${i}'s";
# print "$j contains $i";
}
}
printf STDERR "Took %d seconds for %d words\n",
time() - $start, scalar @words;
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [Watch: Dir/Any] [d/l] |
|
Your approach does n^2 comparisons with n=(number of words in list). I took a different aproach of sorting the list then comparing each word only with the following words in the array. This cuts the comparisons by half when n gets large as in this case. I'm curious how grep with the regex will stand up against your approach.
As for the extra overhead for sorting the word list my guess is the extra sorts at the beginning pay off as n gets large. What do you think? I have been wanting to try out benchmarking for some time after seeing you use it so much. This could be a good opportunity.
use warnings;
use strict;
open LOG, ">", 'wordsinwords_LOG.txt' or die $!;
#Sort alphabetically first so identical words end up together.
chomp (my @words = sort { length($a) <=> length($b) } sort <DATA>);
@words = map {lc} @words;
for (my $i=0; $i<$#words; $i++) {
my $word = $words[$i];
next if $word eq $words[$i+1];
my @matched = grep {/$word/ and $_ ne "${word}s" and $_ ne "${word
+}'s" } @words[$i+1..$#words];
print LOG "$word => @matched\n" if @matched;
}
__DATA__
at
Ate
crate
tree
Trip
tread
read
ad
at
ads
crate's
Log file contains:
ad => read tread
at => ate crate crate's
ate => crate crate's
read => tread
This is interesting. http://wordlist.sourceforge.net/ SCOWL (Spell Checker Oriented Word Lists) has 652,000 words plus a range of smaller lists. | [reply] [Watch: Dir/Any] [d/l] [select] |
|
Sorry, but only comparing words with those that come after it alphabetically doesn't work.
For example, the work 'call' appears in all these words that precede it alphabetically:
abiogenically
abiotically
academically
acerbically
achromatically
acoustically
acrobatically
acronymically
acrostically
actinically
adiabatically
adrenergically
aerobically
aerodynamically
aeronautically
aesthetically
agonistically
agronomically
alchemically
alcoholically
algebraically
algorithmically
allegorically
allopatrically
allosterically
allotypically
alogically
alphabetically
altruistically
amitotically
anacoluthically
anaerobically
anagogically
analogically
analytically
anaphorically
anarchically
anatomically
anchoritically
anecdotically
anemically
anesthetically
angelically
animatronically
anisotropically
anodically
antibiotically
antically
antidromically
antigenically
antiseptically
antithetically
aoristically
apathetically
aperiodically
aphetically
aphoristically
apically
apocalyptically
apodictically
apolitically
apologetically
apomictically
apoplectically
aposematically
apotropaically
aquatically
archaically
arctically
arithmetically
aromatically
arthritically
artistically
ascetically
aseptically
asthmatically
astrologically
astronautically
astronomically
astrophysically
asymmetrically
asymptotically
asyndetically
atavistically
atheistically
athletically
atmospherically
atomically
atomistically
atypically
authentically
autistically
autocratically
autographically
automatically
autonomically
autotrophically
axenically
axiologically
axiomatically
ballistically
barbarically
barometrically
basically
bathetically
bathymetrically
beatifically
biblically
biochemically
biogenetically
biographically
biologically
biomechanically
birdcall
birdcalls
bolometrically
bombastically
botanically
buccally
bucolically
caecally
call
And the word 'the' appears in 1300 words that precede it in my 178,000 word dictionary.
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [Watch: Dir/Any] [d/l] |
|
|
|
|
|
|
SCOWL is the wordlist that I am using!. I just combined the 256 files and pulled a distinct list of words
I am starting to look at trying the code you provided but need to figure out where you are defining what is loaded into <DATA>
I will let you know what I find
Thank you.
| [reply] [Watch: Dir/Any] |
Re: Words in Words
by thezip (Vicar) on Sep 30, 2011 at 18:06 UTC
|
While I'm not an expert on tries, there seems to be a lot of potential there for your stated problem. Give it a look.
Here's a trie implementation in Perl.
What can be asserted without proof can be dismissed without proof. - Christopher Hitchens
| [reply] [Watch: Dir/Any] |
|
yes searching in Suffix_tree is O(1), but constructing those trees implies a overhead.
Since in our case all searched strings are also known in advance, I'm not sure that this is really the fastest approach.
At least the construction process will already identify embedded strings, since they are leaves of the suffix tree.
| [reply] [Watch: Dir/Any] |
Re: Words in Words
by ww (Archbishop) on Sep 30, 2011 at 19:29 UTC
|
Re Line 23, you MAY speed things up a little by reading $wordfile directly into a hash. Searching Google for site:PerlMonks.org +Perl +'file to hash' produces some relevant results in prior discussions here in the Monastery. Strike the site specification and you'll find some other sources (albeit, of unknown reliability).
Even with a wordlist of the size you're using, that runtime seems far to the high-side. Is the box upon which you're writing this reasonably current (fast?) and what version of Perl are you using?
Counting the first para above, you now have three reasonable alternatives (update: and one very reasonable question about the precision of your spec </update>) ... so the rest of this node will focus on some nits.
Line 4 seems to reflect a view that $datapath and $wordfile are constants. While you're using them within the scope of the conventional meaning of "constant," they aren't in the sense of allowing the compiler to optomize.
Lines 9 and 10 declare global variables... not, IMO, as major problem here, but in more complex programs, it's wise to declare them in such a way as to minimize their scope (for more on this, try perldoc -q scope at your CLI).
Your comment at Line 26 reflect what is actually ( IMO, usually [...additional qualifiers may be required] ) a better (if not "best") practice. The more complex you make your regex, the more opportunities you'll have to obscure a logic problem and/or to create something that sets the regex engine thrashing.
| [reply] [Watch: Dir/Any] [d/l] [select] |
|
I'm using Perl 5.14.1 on Windows 2008R2 i7 processor with 16 gigs of RAM. It should be plenty fast if the correct solution is applied.
I will look into the nits (which I appreciate you calling out).
| [reply] [Watch: Dir/Any] |
Re: Words in Words
by choroba (Cardinal) on Sep 30, 2011 at 19:53 UTC
|
Did I understand points 1) and 2) correctly? This script finishes a list of almost 640,000 entries in less then a minute (50sec), adding conditions 3) and 4) should be really easy.
#!/usr/bin/perl
use feature 'say';
use warnings;
use strict;
my $file = '/etc/dictionaries-common/words';
open my $IN, '<', $file or die "$!";
my %words;
while (my $word = <$IN>) {
chomp $word;
undef $words{$word};
}
for my $word (keys %words) {
my $length = length $word;
my %found; # report each occurence just once
for my $pos (0 .. $length - 1) {
my $skip_itself = ! $pos;
for my $len (1 .. $length - $pos - $skip_itself) {
my $subword = substr($word, $pos, $len);
next if exists $found{$subword};
if (exists $words{$subword}) {
say "$subword in $word";
undef $found{$subword};
}
}
}
}
Update: if I ommit the "just once" condition, the script finishes in 40 secs on my Mac Mini. | [reply] [Watch: Dir/Any] [d/l] |
|
I may not have clarified the specs as well as I had hoped.
The results should be a distinct list of the words in the list that can be found contained within another word in the list taking into consideration the exclusions I listed.
When I run this code, it returns words that are not contained in the wordlist. They are just pieces of the word.
I will say that it is pretty fast though and I may be able to use this technique to get what I am looking for.
Thank you.
| [reply] [Watch: Dir/Any] |
|
When I run this code, it returns words that are not contained in the wordlist. They are just pieces of the word.
Have you changed the path to the input file? The code should only print words from the list!
| [reply] [Watch: Dir/Any] |
|
|
Re: Words in Words
by jdporter (Chancellor) on Sep 30, 2011 at 18:59 UTC
|
checks to see if each word exists within another word in the list
Impossible; such a list would be circularly containing! Therefore, the answer is always false. O(1)
Or did you mean, checks each word to see whether it exists within another word in the list?
| [reply] [Watch: Dir/Any] |
|
Yes. I do mean the program should, as you stated, "checks each word to see whether it exists within another word in the list"
| [reply] [Watch: Dir/Any] |
Re: Words in Words
by chrestomanci (Priest) on Sep 30, 2011 at 20:23 UTC
|
Instead of using regular expresson to check if one word is a substring of another, why not use index instead.
It is a simple function that returns the offset of the shorter string within the longer one if it is a substring, or -1 it it is not. The kind of thing you expect to find in the string libaries of programing languages without regular expresson support, but so rarely used by perl programmers that most of us have forgotten the function exists.
| [reply] [Watch: Dir/Any] |
|
| [reply] [Watch: Dir/Any] |
|
| [reply] [Watch: Dir/Any] |
|
Something like Boyer-Moore?
| [reply] [Watch: Dir/Any] |
Re: Words in Words
by Limbic~Region (Chancellor) on Oct 01, 2011 at 23:29 UTC
|
| [reply] [Watch: Dir/Any] |
Re: Words in Words
by aartist (Pilgrim) on Oct 01, 2011 at 01:04 UTC
|
What is the purpose of this exercise? May be it is an XY problem and your needs can be fulfilled in a better way.
Even if you find a way to do what you have asked, what you are going to do with the results? If we know before and after, we may direct you to a better version of answer.
| [reply] [Watch: Dir/Any] |
|
This was simply a challenge put out by a friend.
I tried to make this work in a couple of the languages I use most often (T-SQL, COBOL, and C#) but I was using a looping method that just was not very efficient. I tried to write something in PERL because I've used it to split files and for very minor file manipulation but again my approach was looping based. One of my friends suggested using a hash but my implementation of the hash was to find results using grep and then loop through.
Finally I realized that I couldn't come up with an appropriate solution and sought the wisdom of the experts and it has been a tremendous learning experience.
As for the use, this was simply an exercise to get the correct answer with the most efficient code. I was able to get the correct answer but my runtimes were longer than a day. One of my friends said he solved it with hashing and his program ran in 5 minutes.
I had the answer but wanted to see what the most efficient way of solving this problem.
Thank you for your participation!
| [reply] [Watch: Dir/Any] |
Re: Words in Words
by LanX (Saint) on Oct 01, 2011 at 13:25 UTC
|
Hi
> 3) A word cannot match itself as a plural even if it makes a different word (ie: "a" cannot match "as")
Please define "plural"!
In English "children" is the plural of "child", "mice" of "mouse".
And IMHO there are examples where WORDs is not the plural of WORD.
(UPDATE: Like "is" and "I")
| [reply] [Watch: Dir/Any] |
|
"Word1" cannot equal "Word1"
"Word1" cannot equal "Word1" + "s"
"Word1" cannot equal "Word1" + "'s"
That's it!
| [reply] [Watch: Dir/Any] [d/l] |
Re: Words in Words
by perlaintdead (Scribe) on Sep 30, 2011 at 19:29 UTC
|
| [reply] [Watch: Dir/Any] |