|Perl: the Markov chain saw|
incremental reading of utf8 input handlesby Tanktalus (Canon)
|on Jul 06, 2012 at 17:24 UTC||Need Help??|
Tanktalus has asked for the
wisdom of the Perl Monks concerning the following question:
The joys of I18N. (That's internationalisation - I, 18 characters, N - for those who have been fortunate enough not to have to worry about it.) My code will be running in an arbitrary language (we support 9 or 10, including Japanese, which I'll use for my example here). When we run subprocesses, we will sometimes run them in that language, and sometimes in LANG=C, depending on whether the output might be displayed to the user. In cases where the output may be shown to the user, will will then take the text, embed it in our own text (using Template toolkit), and then send it to the user (via Log::Log4perl sending it to both STDOUT and a log file). Most of this is working.
However, one issue still remains. Reading from those subprocesses. If the output is small enough not to flood any buffer, setting the input stream layer to :utf8 seems to work. However, if one of the many buffers involved (remote libC, remote kernel, remote sshd, remote TCP stack, switch, local TCP stack, local kernel, local ssh, local libC, AnyEvent's sysread) manages to split a UTF-8 character, there is the concern that the utf8 layer will not handle this, and the decoding could go awry when I try to interpret it and store it. The key part here is "interpret" - if I don't care about the interim output, it's all fine as I could theoretically just suck it all in to a single scalar and decode it at the end. However, sometimes I need to watch for key words, phrases, tokens, whatever (and, yes, this is more problematic if the text is translated, but that's not part of this question), and report back to the user some sort of progress. So I need to utf8-decode this text incrementally.
So, the question is: is there any way to be sure I have valid UTF8 characters, and to decode them, without losing partial characters? Something like:
The major part of that is if that regex is right, or, if not, what would be the appropriate way to determine this?
Most of the time, I want this done, I'll have to signal when I'm using Storable in the subprocess and thus don't want any translation done, but that's a SMOP. And, if it makes things any easier, I think that line-oriented is sufficient. So, as nice as a more-general solution would be, if I understand utf8 right, the "\n" character should only show up in the byte stream if it's an actual "\n" character, and not as a secondary (or later) byte in a utf-8 character. So, would looking for whole lines be sufficient (s/^([^\n]*\n)//) to ensure I don't break up any utf-8 characters, saving the rest of the bytes for the next time through the loop? The check here would be that if the last line doesn't have a terminator, I still have to handle it, but, again, that's a SMOP.
I'm somewhat new to the whole utf8 thing, and I'm still the go-to expert on the team :-S so I want to get it right :-)