Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

creating crystal ball 2

by grizzley (Chaplain)
on May 19, 2008 at 19:56 UTC ( [id://687457]=perlmeditation: print w/replies, xml ) Need Help??

First part

Every time I have a while and go back to the problem, I am not able to think very much forward. Maybe there is nothing more to think about? So let's add few words and finish the topic.

In the first part I wrote "All letters 's' at the beginning of word are marked bold. Two matches. Ok, we take first one always.". It definitely has to be changed. What options do we have here?

  1. Penalty for not being used / bonus for being used - mostly used ones in first place
    Pointed already by roboticus. Obvious. No question about it. If something has been used already - it is good and will be suggested to be reused. Don't think about pieces of toilet paper here...
  2. longest look-behind on the first place
    Obvious, too. If there is look-behind match 9-chars-long, the place is definitely better candidate than the 1-char-long one.
  3. common part from all proposed strings
    Finding common part even of few strings can be very difficult and I do not expect this to return any useful result.
  4. common part from some of proposed strings
    however, is promising one. How to decide which subset should be chosen? Best 10 strings chosen with help of 1st or 2nd method? Maybe, like in statistic research, take all proposals and remove e.g. 30% edge cases? What are edge cases? (String::Approx and Text::Levenshtein comes to mind, but as the thing is to predict exact string, I don't know if it will be of any use) Or maybe sort alphabetically all strings and return the biggest group starting with the same letter?
  5. closest to the current position
    This idea assumes that the paragraph/chapter is about something (e.g. 'unemployment') and similar words are near. See how definition of unemployment looks in wikipedia:
    Unemployment is the state in which a person is without work, available to work, and is currently seeking work. The unemployment rate is used in economic studies and economic indexes such as the United States' Conference Board's Index of Leading Indicators.
    Check the words: 'Unemployment', 'work', 'economic'.

And Yet Another Idea to consider during implementation: mark with digits 10 best candidates in text with possibility to use via Ctrl+<digit>, Ctrl+<left/right arrow> adds/removes parts of text (words), Enter accepts.

I was reading about LZW, Huffmann, Markov Chains, PPM, PAQ and other compression techniques suggested by you, guys. I don't exactly know how to use it to complement text. These algorithms operate on chars instead of words and this has made me think about following: suggesting one char is of no use. But what about suggesting 2 more chars every time user presses predicted char? Normal text at the beginning is what's already typed, bold is suggested(selection), and the rest light grayed to show what computer "has in mind".

unemployment
unemployment
unemployment
unemployment
unemployment
unemployment

Almost 50% of typing saved... Or better, 2 chars at the beginning, 3 in the next step, etc.

unemployment
unemployment
unemployment
unemployment
unemployment

Well, so much for theory. I plan to start implementing it in a 2-3 weeks as I have now other important things to do, so please be patient :)

Replies are listed 'Best First'.
Re: creating crystal ball 2
by blokhead (Monsignor) on May 19, 2008 at 22:04 UTC
    I googled for "text prediction" and found soothsayer (homepage) as one of the first hits. Sounds like a programming library that does exactly what you're looking for.

    Another thing that comes to mind is Dasher, although its novelty is not the text-prediction but its accessible interface. However, its text prediction seems good, it can be trained, and it is fast in real-time. It probably uses some hidden Markov mechanism for prediction. Since it is open source, the prediction mechanism can probably be factored out (if it isn't already) and adapted to your interface needs.

    Even if you are determined to do things yourself (yikes), you can at least learn a lot by seeing how the others do it.

    blokhead

      Pity, everything is already written... Yeah, of course I'll try this at home. If it will work fine, it may effectively discourage me to write my own piece of Tk code. Trying Dasher online was really interesting, especially connected with speech. Little slow, but interesting.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlmeditation [id://687457]
Approved by marto
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others romping around the Monastery: (5)
As of 2024-03-19 08:46 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found