Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

creating crystal ball

by grizzley (Chaplain)
on Apr 21, 2008 at 12:37 UTC ( [id://681906]=perlmeditation: print w/replies, xml ) Need Help??

Quite often I have to type some translations with specific topic and specific vocabulary. As a professional programmer, I am very lazy and would like the computer to type it for me :) I started thinking about some textbox with autocompletion option, which will learn from what I am typing, find same parts in already typed text and offer hints. My native language is Polish, where declination and conjugation complicates the problem a lot, but I wanted start in english just to see how much work will take even simpliest solution.

What is the goal? Having text of n characters I would like to type only k letters, where k aims to 0. Without any dictionary at the beginning. Ok, there can be one: I open some file with other translation and on that basis the application initializes its dictionary/structures/autocompletion engine.

Well, what workers could I borrow from Perl and employ?

1. Word autocompletion. Application stores every typed word in some struct and lists matching words while I start typing another one. It's fast and simple to implement. And saves me much work, but only for long words.

2. Context search. There are regexps with negative, positive assertions. Hey, why don't do contextual search and try to guess next word without typing any char? It would be also really helpful for phrases like "on the top of" or "why don't you go to the". But this topic is really showing almost infinite possibilities. Actually I don't really need 3rd option...

Maybe some example text:

An entry is a widget that displays a one-line text string and allows
that string to be edited using methods described below, which are
typically bound to keystrokes and mouse actions. When first created,
an entry's s

's' at the end is the beginning of 'string' word. All letters 's' at the beginning of word are marked bold. Two matches. Ok, we take first one always. And now, time for the first really important question:

How much to autocomplete?

  • To the end of the text?
  • To the first period or coma?
  • First x chars? (rounding, so the last word is not cut)

The problem is that autocompletion cannot offer unneeded chars at the end, because work saved on typing, will be spent on deleting those extra chars/words. Maybe just 2 modes of autocompletion? Which are: it shows e.g. 5-word long phrase and with 'enter' I accept it all, and with 'tab' - only first word? Let's save this idea for later, can be useful :)

What then? After accepting only part of autocompletion, what now? Should autocomplete create new list of possible hints, or do a

contextual search? How far?

Yes, positive look-behind assertion comes in mind, isn't it? :) I move it aside as I want to think more abstract right now, and after few seconds it shows up again and again :) Like mosquito :) Definitely it is good place for it. Go away, you little disturbing bastard, there will be time for you!

I must think once more, how much back should it look actually to do the right job? Should it look one word back? Two words? To the coma/period? Or maybe all those possibilities together? Where is the furthest limit? The beginning of the document... Well,

how much resources should I use?

Can I put just all possible variations ('it', 'it is', 'it is the', 'it is the string') to the cleverly constructed hash/array? Let's hope it won't slow down the thing awfully and won't eat all my RAM... But will it be really useful? Normally in the texts the same parts of text are not so long - few words at the most. Is there really a reason to look for 'let me finish this task for you and in this time you can do us both a tee'? I don't think so. And of course is it acceptable if the application takes 2 minutes to process all the text only to show 20 letters long super clever autocompletion? Definitely not, in 2 minutes I can write many letters, a lot more than 20. This must work fast to be helpful.

Emerging picture of the situation.

1000 words written, cursor at the end waiting for 1001st word. Let's operate on whole text. Now the simpliest regexp on the world looks for all word boundaries /\b/. Ca. 2000 matches. Application takes every matched position, looks behind (how far? 6 words?), compares with text just before cursor (how far? at least one word?), looks forward (how far? 15 letters?) and suggests something (how far? 15 letters too?). How many possibilities will there be?

Continue reading...

Replies are listed 'Best First'.
Re: creating crystal ball
by roboticus (Chancellor) on Apr 21, 2008 at 13:40 UTC
    grizzley:

    Sorry ... I couldn't wait for you to post the next chunk. It's an interesting question, and while reading it, I couldn't help thinking about LZW compression. I'm not sure if you're familiar with it or not, but I think you could adapt it to your problem, especially considering:

    What is the goal? Having text of n characters I would like to type only k letters, where k aims to 0. Without any dictionary at the beginning.
    ...snip...
    1. Word autocompletion. Application stores every typed word in some struct and lists matching words while I start typing another one.
    ...snip...
    2. Context search. There are regexps with negative, positive assertions. Hey, why don't do contextual search and try to guess next word without typing any char?

    Using an LZW-like table would allow you to construct your dictionary as you use it. So as you work on a project, it would continue to improve its autocompletion guesses. Since you're not worrying about compression/decompression, you can easily edit the table. If you add a timestamp to the nodes, you could even prefer the most-recently-used words in preference to the most common. (Thus, the variable '$food' in your current project could be preferred to '$foobar' that you used in your example programs yesterday.)

    As far as regexps are concerned, if you use the LZW-table idea, you wouldn't use regexes at all. You'd instead build a state machine where the last "n" keystrokes would be encoded into the current state, and you could examine the edge transitions from the current state and propose words/phrases based on usage.

    As far as how much to autocomplete: I'm thinking that with the LZW table, you'd default to autocomplete all the "common" edges until you hit a steep drop in probability. Then you could use gestures to alter the amount to autocomplete. Something like: space would autocomplete the current word, tab would autocomplete the current "best guess". (If you display more than one at a time, you could use the down arrow to discard the current best guess...) An shifted character could accept up to and including that character in the current best guess, and a normal unshifted character could be entered normally (causing your table to update probabilities, etc.).

    Regarding resource usage: Memory is cheap, so you might choose a (reasonably large) number of edges as a target, and if your table reaches that size, you might cull 5-10% of the edges--those with the lowest probability.

    Other ideas that come to mind:

    • Allow the user to save the current table as their "baseline" to be used to start all other documents. (Of course, you'd want to offer the ability to have several baselines, so each project might have a different one, or one for documentation and one for coding, etc.)
    • You could store the table in a reasonably understandable format so it could be edited by hand...
    • If you hook the keyboard such that you can capture the control, alt, shift etc. keys, you could easily add some extra command gestures.
    • You could have small packages of "aliases" that you could load on demand, like those in vim or bash by simply adding a set of edges with very high usage counts.
    • You could even use time as a gesture: It might propose a common phrase, but you don't accept it. After a delay, it could trim off the end word and wait a little more. So you could simply wait for it to trim it to size if you like. (I think that would drive me nuts, but others might like it.....)

    Hmmm.... now I've got it percolating in the back of my head. If you decide to start implementing it with this table-based idea, let me know, and I'll jot down my ideas for the data structure and such. Be sure to keep us apprised of this project, it sounds fun!

    Can I put just all possible variations ('it', 'it is', 'it is the', 'it is the string') to the cleverly constructed hash/array? Let's hope it won't slow down the thing awfully and won't eat all my RAM... But will it be really useful? Normally in the texts the same parts of text are not so long - few words at the most. Is there really a reason to look for 'let me finish this task for you and in this time you can do us both a tee'? I don't think so.

    One thing I like about the node/edge table-driven approach is that since you're culling low-probability edges, high probability strings can build to be arbitrarily long. So if you were tasked with typing your example sentence a hundred times as punishment, the sentence would quickly become the next predicted text.

    And of course is it acceptable if the application takes 2 minutes to process all the text only to show 20 letters long super clever autocompletion? Definitely not, in 2 minutes I can write many letters, a lot more than 20. This must work fast to be helpful.

    Yep ... that's the sticking point. I'm a touch-typist, and long delays would irritate me to death. But with computers getting faster and faster, you can throw more power at it each year. It's an interesting problem ... I'm looking forward to seeing what you come up with.

    ...roboticus
      ... and while reading it, I couldn't help thinking about LZW compression. I'm not sure if you're familiar with it or not, but I think you could adapt it to your problem

      Funnily, the original post was making me think about Prediction by Partial Matching compression!

      Sorry ... I couldn't wait for you to post the next chunk. It's an interesting question...

      Good, that you didn't wait. I plan to enter next chunk in a few days rather than in a few hours. You have almost written that for me, because in a natural way thinking about the problem goes in direction 'probabilities', 'mostly used', etc. Let me answer this part in another meditation.

      ...and while reading it, I couldn't help thinking about LZW compression. I'm not sure if you're familiar with it or not...

      Yes, I know about compression algorithms, maybe not in detail, but overall technology. It will help storing those all possibilities, no doubt. Made me think about another approach: offer common 2-, 3-letter long hints, e.g. 'thr', 'in', 'st'. But it won't work. Either I'll end learning to stenotype (and learning new alphabet/set of symbols) or spending time accepting offered hints. I mean, what is quicker: to choose from 'is', 'if', 'in' or just to write letter 'i' and letter 's'?

      Something like: space would autocomplete the current word, tab would autocomplete the current "best guess".

      Space is not a good thing to accept hint while typing text full of spaces. Rather Tab and Enter. But it's detail and matter of personal preference. One can make shortcuts configurable.

      Regarding resource usage...

      I think it is not yet time for bothering about resources. You are right, memory is cheap, CPUs fast. A propos, I 'discovered' what efficiency is needed for final solution: it depends on user's typing speed. If user types 10 CPM(chars per minute), algorithm should run 10 times per second, for me it will be ca. 40 CPM, if there is any 100CPM magician willing to use it, it would be nice to have it working with frequency 100/s. Tough constraint...

      If you decide to start implementing it with this table-based idea, let me know, and I'll jot down my ideas for the data structure and such. Be sure to keep us apprised of this project, it sounds fun!

      You can be sure I won't keep it hidden :) I want to split the fun to following parts:

      • limit the problem, so that it fits in my head :) Much fun. It's almost finished.
      • implement some solution(s). More fun.
      • optimize it for speed! So much fun! Is anyone here who doesn't like this step? :)
      • golf it! Don't have to mention how much fun it is :)
      ('Project'? He called this a 'project'... I though I'll end up with one file, maybe 50 lines long?...)

      P.S. How do you, people, do it?!? I mean, writing nodes so quick! It takes time to preview and format correctly and you give almost immediate (and so long!) response! Admit, you had this answer prepared for a few weeks and just waited for proper question... :)

        grizzley:
        P.S. How do you, people, do it?!? I mean, writing nodes so quick! It takes time to preview and format correctly and you give almost immediate (and so long!) response! Admit, you had this answer prepared for a few weeks and just waited for proper question... :)

        Well ... you almost answered it yourself earlier...

        if there is any 100CPM magician willing to use it, it would be nice to have it working with frequency 100/s. Tough constraint...

        I first touched a computer (terminal) in high school, and immediately realized that I'd be using them for the rest of my life......so I took typing classes the very next semester. The large number of pretty girls in the class had no bearing on my decision! ;^) It was perhaps the most useful course I took.

        One thing about being a relatively fast typist, though...I can't *stand* it when an IDE tries to help me type or predict what I'm going to do next. There's a "flow-state" you're in when you touch-type, and making decisions on choosing text fragments keeps pulling you out of the flow, so it ultimately slows me down. So even if you're successful in this project, I still won't be able to use it... ;^P

        ...roboticus

      Your point about processing speed is right. However, even if your computer is very fast, you have to take into account the speed the user can think. If the autocompletion is not predictable, and the user keeps having to watch the screen for whether the completion is right, you lose lots of speed.

      Also, compression reminds me to that christmas carol.

        You are right when one person reads text from a piece of paper and writes. It will help only in case: "Hey, let's split the work: read it loud, and I'll write. This way we will finish faster." (like in the joke: why policemen always go two by two? Because always one of them can read, and the other one can write)
Re: creating crystal ball
by blahblahblah (Priest) on Apr 22, 2008 at 04:28 UTC
    You might also check out some of the tools here: Computer-assisted translation. There are some open-source ones, so you could peak into how they're implemented.
Re: creating crystal ball
by starbolin (Hermit) on Apr 24, 2008 at 21:51 UTC

    My first reaction paralleled roboticus when he said:

    One thing about being a relatively fast typist, though...I can't *stand* it when an IDE tries to help me type or predict what I'm going to do next. There's a "flow-state" you're in when you touch-type, and making decisions on choosing text fragments keeps pulling you out of the flow, so it ultimately slows me down. So even if you're successful in this project, I still won't be able to use it... ;^P
    But I recently have been entering code ( or attempting to enter code ) on my PDA and I find myself using Bash alias a lot, creating many small template files, and using vi abbreviations. The auto-completion tool that was provided was useless for Perl code as it lacks any kind of context sensitivity and is too heavily weighted towards short tokens; so I think your idea has merit in "alternate interface" applications.


    s//----->\t/;$~="JAPH";s//\r<$~~/;{s|~$~-|-~$~|||s |-$~~|$~~-|||s,<$~~,<~$~,,s,~$~>,$~~>,, $|=1,select$,,$,,$,,1e-1;print;redo}

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlmeditation [id://681906]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others scrutinizing the Monastery: (6)
As of 2024-03-19 14:05 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found