in reply to Byte counts and Seek function
You're in for a world of pain if you try to mix byte counts with UTF-8, because a UTF-8 glyph may be represented by more than one byte's worth of codepoints. seek doesn't take variable-width encodings into account. It only counts bytes.
(I don't know what your utf8 function does, so I can't comment on what your call to encode does.)
Seems to me that it would be easier to use
pos tell when you read in a sentence and keep that position around, rather than try to reconstruct it from the data you've read (and decoded, possibly normalized, et cetera).