Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

Re: Re: Re: Junk NOT words

by bart (Canon)
on Nov 01, 2002 at 12:54 UTC ( [id://209694]=note: print w/replies, xml ) Need Help??


in reply to Re: Re: Junk NOT words
in thread Junk NOT words

OK, OK, I concur! So there are a few sequences of consonants with no vowel that can be considered a word. However, there aren't many. The whole mechanism can remain the same, except that, if you're left with a string of consonants between words, that doesn't immediately means failure. Now you'll have to do an additional check to see if such a string exists of a sequence of these exceptions. I think that they are such a small minority that, for speed, it must be worth it extracting them all from a dictionary and storing them separately in a data file, before you even start.

Am I alone, in feeling that the whole search system as I proposed, is very similar to how a regex may try to match a pattern, in a "penny machine"? Pick a candidate, try every possibility with it in turn, backtrack...

Replies are listed 'Best First'.
Re: Re: Re: Re: Junk NOT words
by BrowserUk (Patriarch) on Nov 01, 2002 at 13:24 UTC

    Sorry Bart. I read your original post in isolation of the full thread and hadn't realised that I was repeating what others had already said.

    If you've seen my attempt at this at Re: Junk NOT words you'll have seen that my word list manages to match just about anything with one or two characters as a word. I decided to go through 1 & 2 char entries by hand and remove those that where nonsensical, but discovered to my surprise that many more of them are valid in some contexts than you might suppose.

    For instance, 'x' - Outside of math or computing this doesn't seem like a valid word, but I ran across to uses in a scan of my correspondance that I have sent and recieved. The first in the phrase "X marks the spot" the second in a email from my sister signed "x. jj".

    In other notes this became "xx. jj" and "xxx. jj". I guess I'm more loveable at sometimes than others. 'jj' are her first 2 initials BTW, so that meant that had to stay. Ah! 'BTW' there's another one. And so it went on. I found it extremely difficult to remove any of either the single chars or many of the digraphs as I could, without much effort, find (or think of) legitimate cases where they could crop up in 'normal' correspondance.

    I wasn't jumping on the bandwagon with this, just reflecting my own, somewhat surprising discovery.


    Nah! Your thinking of Simon Templar, originally played by Roger Moore and later by Ian Ogilvy

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://209694]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others browsing the Monastery: (6)
As of 2024-04-16 10:00 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found