Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

Re: Re: Junk NOT words

by BrowserUk (Patriarch)
on Nov 01, 2002 at 03:33 UTC ( [id://209630]=note: print w/replies, xml ) Need Help??


in reply to Re: Junk NOT words
in thread Junk NOT words

There are also a whole host of places in english stripped of puctuation where non-vowel containing "words" would crop up. Eg. Mr Mrs Dr Jnr etc.


Nah! Your thinking of Simon Templar, originally played by Roger Moore and later by Ian Ogilvy

Replies are listed 'Best First'.
Re: Re: Re: Junk NOT words
by bart (Canon) on Nov 01, 2002 at 12:54 UTC
    OK, OK, I concur! So there are a few sequences of consonants with no vowel that can be considered a word. However, there aren't many. The whole mechanism can remain the same, except that, if you're left with a string of consonants between words, that doesn't immediately means failure. Now you'll have to do an additional check to see if such a string exists of a sequence of these exceptions. I think that they are such a small minority that, for speed, it must be worth it extracting them all from a dictionary and storing them separately in a data file, before you even start.

    Am I alone, in feeling that the whole search system as I proposed, is very similar to how a regex may try to match a pattern, in a "penny machine"? Pick a candidate, try every possibility with it in turn, backtrack...

      Sorry Bart. I read your original post in isolation of the full thread and hadn't realised that I was repeating what others had already said.

      If you've seen my attempt at this at Re: Junk NOT words you'll have seen that my word list manages to match just about anything with one or two characters as a word. I decided to go through 1 & 2 char entries by hand and remove those that where nonsensical, but discovered to my surprise that many more of them are valid in some contexts than you might suppose.

      For instance, 'x' - Outside of math or computing this doesn't seem like a valid word, but I ran across to uses in a scan of my correspondance that I have sent and recieved. The first in the phrase "X marks the spot" the second in a email from my sister signed "x. jj".

      In other notes this became "xx. jj" and "xxx. jj". I guess I'm more loveable at sometimes than others. 'jj' are her first 2 initials BTW, so that meant that had to stay. Ah! 'BTW' there's another one. And so it went on. I found it extremely difficult to remove any of either the single chars or many of the digraphs as I could, without much effort, find (or think of) legitimate cases where they could crop up in 'normal' correspondance.

      I wasn't jumping on the bandwagon with this, just reflecting my own, somewhat surprising discovery.


      Nah! Your thinking of Simon Templar, originally played by Roger Moore and later by Ian Ogilvy

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://209630]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others imbibing at the Monastery: (9)
As of 2024-04-23 21:59 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found