Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Re: Re: Finding dictionary words in a string.

by ehdonhon (Curate)
on Mar 13, 2004 at 18:53 UTC ( #336396=note: print w/ replies, xml ) Need Help??


in reply to Re: Finding dictionary words in a string.
in thread Finding dictionary words in a string.

What I meant was that the entire string "Hello" gets a better score than "HelloHowAreYou" because there are fewer words. But either of those get a better score than "oijwfHellowoifef", for example.

There needs to be some tradeoff, which I'm not quite sure how to judge at the moment. For example "ThisIsAStringThatGoesOnAndOnAndOnForever" and "perlxmonks" would probably have a nearly equivalent score because the first one doesn't have any junk characters, but has a high word count while the second one has only a few junk characters, but a low word count


Comment on Re: Re: Finding dictionary words in a string.
Re: Re: Re: Finding dictionary words in a string.
by tachyon (Chancellor) on Mar 13, 2004 at 19:57 UTC

    There is not really a trade off required. Given the task which is to look for 'good' urls that match as closely as possible 1 or more words you want to do something like (pseurocode)

    # get the domain part dropping www. and passing back the # domain and tld (ie .com .net) or sld.tld (ie co.uk, com.au ) my ($domain, $tld) = get_domain( $url ); # chop domain into all possible substrings say 3-16 chars long, retrun + ary ref # there are very few valid well known words > 16 chars, virtually none + > 24 chars my $tokens = tokenize( $domain ); # get the possible words ordered by length(word) DESC ie longest first # use a hash lookup or a RDBMS with a dynamicly generated SQL IN claus +e my $words = get_real_words_from_dict( $tokens ) # substitute out the words, as we remove longest first # we aviod finding substrings like 'be' in 'beetle' my $residual = $domain; my @words = (); for my $word( @$words ) { # we may have duplicates of same word push @words, $word while $residual =~ s/\Q$word\E/; } # remove '-' from residual so 'come-here' will be two words, no residu +al $residual =~ s/-//g; # work out % residual $residual = 100*$residual/$domain; # So now we have our data # @words 0..N is number of words found # $residual is the %residual on that domain name # $tld is the domain name # say we inserted into a Db table like: CREATE TABLE urls ( url CHAR(75), words INT, residual FLOAT, tld CHAR(10), ) "INSERT INTO urls (url,words, residual, tld) VALUES(?,?,?,?)", $url, scalar @words, $residual, $tld You can now generate reports. Essentially you want something like: SELECT url FROM urls WHERE words >= 1 ORDER BY words ASC, residulal ASC GROUP BY tld

    This does not apply limits or add bias for say a pref for .com domain names. It will output urls with single word, lowest->highest residual first, then two words etc. Given what you want if the residual is > 10-20% you can probably just ignore those URLs and not insert them.

    cheers

    tachyon

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://336396]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others perusing the Monastery: (5)
As of 2014-09-20 04:42 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (153 votes), past polls