Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

Re: Finding dictionary words in a string.

by graff (Chancellor)
on Mar 13, 2004 at 18:40 UTC ( [id://336394]=note: print w/replies, xml ) Need Help??


in reply to Finding dictionary words in a string.

The scoring you describe seems like the more interesting part, and could guide the choice of a search method, but there's not enough detail in the OP figure it out.

In particular, there needs to be some sort of trade-off between the two scoring criteria: "strings that are composed entirely of words get a better score" is somewhat at odds with "strings with the fewest number of words will get a higher score". Are you talking about solutions that involve just non-overlapping words? If so, then you would need to compare various possible parses of the input string. How would you score a string like "labsolve"? Does "lab, solve" score better or worse than "absolve, one letter left unused"?

In either case, it might be helpful to have your dictionary stored as an array of hashes; the index into the array is the length of words in that hash. (The hashes themselves could be sub-indexed by first letter, perhaps, or first two or three letters, as suggested by the AM in the first reply.) This way, you know before starting on the input string what the longest substring is that you need to match against, and for each substring that you test, you only need to compare it to a limited number of dictionary entries.

  • Comment on Re: Finding dictionary words in a string.

Replies are listed 'Best First'.
Re: Re: Finding dictionary words in a string.
by ehdonhon (Curate) on Mar 13, 2004 at 18:53 UTC

    What I meant was that the entire string "Hello" gets a better score than "HelloHowAreYou" because there are fewer words. But either of those get a better score than "oijwfHellowoifef", for example.

    There needs to be some tradeoff, which I'm not quite sure how to judge at the moment. For example "ThisIsAStringThatGoesOnAndOnAndOnForever" and "perlxmonks" would probably have a nearly equivalent score because the first one doesn't have any junk characters, but has a high word count while the second one has only a few junk characters, but a low word count

      There is not really a trade off required. Given the task which is to look for 'good' urls that match as closely as possible 1 or more words you want to do something like (pseurocode)

      # get the domain part dropping www. and passing back the # domain and tld (ie .com .net) or sld.tld (ie co.uk, com.au ) my ($domain, $tld) = get_domain( $url ); # chop domain into all possible substrings say 3-16 chars long, retrun + ary ref # there are very few valid well known words > 16 chars, virtually none + > 24 chars my $tokens = tokenize( $domain ); # get the possible words ordered by length(word) DESC ie longest first # use a hash lookup or a RDBMS with a dynamicly generated SQL IN claus +e my $words = get_real_words_from_dict( $tokens ) # substitute out the words, as we remove longest first # we aviod finding substrings like 'be' in 'beetle' my $residual = $domain; my @words = (); for my $word( @$words ) { # we may have duplicates of same word push @words, $word while $residual =~ s/\Q$word\E/; } # remove '-' from residual so 'come-here' will be two words, no residu +al $residual =~ s/-//g; # work out % residual $residual = 100*$residual/$domain; # So now we have our data # @words 0..N is number of words found # $residual is the %residual on that domain name # $tld is the domain name # say we inserted into a Db table like: CREATE TABLE urls ( url CHAR(75), words INT, residual FLOAT, tld CHAR(10), ) "INSERT INTO urls (url,words, residual, tld) VALUES(?,?,?,?)", $url, scalar @words, $residual, $tld You can now generate reports. Essentially you want something like: SELECT url FROM urls WHERE words >= 1 ORDER BY words ASC, residulal ASC GROUP BY tld

      This does not apply limits or add bias for say a pref for .com domain names. It will output urls with single word, lowest->highest residual first, then two words etc. Given what you want if the residual is > 10-20% you can probably just ignore those URLs and not insert them.

      cheers

      tachyon

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://336394]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others musing on the Monastery: (5)
As of 2024-04-19 21:11 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found