Re: Finding dictionary words in a string.

The scoring you describe seems like the more interesting part, and could guide the choice of a search method, but there's not enough detail in the OP figure it out.

In particular, there needs to be some sort of trade-off between the two scoring criteria: "strings that are composed entirely of words get a better score" is somewhat at odds with "strings with the fewest number of words will get a higher score". Are you talking about solutions that involve just non-overlapping words? If so, then you would need to compare various possible parses of the input string. How would you score a string like "labsolve"? Does "lab, solve" score better or worse than "absolve, one letter left unused"?

In either case, it might be helpful to have your dictionary stored as an array of hashes; the index into the array is the length of words in that hash. (The hashes themselves could be sub-indexed by first letter, perhaps, or first two or three letters, as suggested by the AM in the first reply.) This way, you know before starting on the input string what the longest substring is that you need to match against, and for each substring that you test, you only need to compare it to a limited number of dictionary entries.

Comment on Re: Finding dictionary words in a string.

Replies are listed 'Best First'.
Re: Re: Finding dictionary words in a string. by ehdonhon (Curate) on Mar 13, 2004 at 18:53 UTC
What I meant was that the entire string "Hello" gets a better score than "HelloHowAreYou" because there are fewer words. But either of those get a better score than "oijwfHellowoifef", for example. There needs to be some tradeoff, which I'm not quite sure how to judge at the moment. For example "ThisIsAStringThatGoesOnAndOnAndOnForever" and "perlxmonks" would probably have a nearly equivalent score because the first one doesn't have any junk characters, but has a high word count while the second one has only a few junk characters, but a low word count	[reply]
Re: Re: Re: Finding dictionary words in a string. by tachyon (Chancellor) on Mar 13, 2004 at 19:57 UTC
There is not really a trade off required. Given the task which is to look for 'good' urls that match as closely as possible 1 or more words you want to do something like (pseurocode) # get the domain part dropping www. and passing back the # domain and tld (ie .com .net) or sld.tld (ie co.uk, com.au ) my ($domain, $tld) = get_domain( $url ); # chop domain into all possible substrings say 3-16 chars long, retrun + ary ref # there are very few valid well known words > 16 chars, virtually none + > 24 chars my $tokens = tokenize( $domain ); # get the possible words ordered by length(word) DESC ie longest first # use a hash lookup or a RDBMS with a dynamicly generated SQL IN claus +e my $words = get_real_words_from_dict( $tokens ) # substitute out the words, as we remove longest first # we aviod finding substrings like 'be' in 'beetle' my $residual = $domain; my @words = (); for my $word( @$words ) { # we may have duplicates of same word push @words, $word while $residual =~ s/\Q$word\E/; } # remove '-' from residual so 'come-here' will be two words, no residu +al $residual =~ s/-//g; # work out % residual $residual = 100*$residual/$domain; # So now we have our data # @words 0..N is number of words found # $residual is the %residual on that domain name # $tld is the domain name # say we inserted into a Db table like: CREATE TABLE urls ( url CHAR(75), words INT, residual FLOAT, tld CHAR(10), ) "INSERT INTO urls (url,words, residual, tld) VALUES(?,?,?,?)", $url, scalar @words, $residual, $tld You can now generate reports. Essentially you want something like: SELECT url FROM urls WHERE words >= 1 ORDER BY words ASC, residulal ASC GROUP BY tld [download] This does not apply limits or add bias for say a pref for .com domain names. It will output urls with single word, lowest->highest residual first, then two words etc. Given what you want if the residual is > 10-20% you can probably just ignore those URLs and not insert them. cheers tachyon	[reply] [d/l]


Perl-Sensitive Sunglasses
	PerlMonks