incremental reading of utf8 input handles

Tanktalus has asked for the wisdom of the Perl Monks concerning the following question:

The joys of I18N. (That's internationalisation - I, 18 characters, N - for those who have been fortunate enough not to have to worry about it.) My code will be running in an arbitrary language (we support 9 or 10, including Japanese, which I'll use for my example here). When we run subprocesses, we will sometimes run them in that language, and sometimes in LANG=C, depending on whether the output might be displayed to the user. In cases where the output may be shown to the user, will will then take the text, embed it in our own text (using Template toolkit), and then send it to the user (via Log::Log4perl sending it to both STDOUT and a log file). Most of this is working.

However, one issue still remains. Reading from those subprocesses. If the output is small enough not to flood any buffer, setting the input stream layer to :utf8 seems to work. However, if one of the many buffers involved (remote libC, remote kernel, remote sshd, remote TCP stack, switch, local TCP stack, local kernel, local ssh, local libC, AnyEvent's sysread) manages to split a UTF-8 character, there is the concern that the utf8 layer will not handle this, and the decoding could go awry when I try to interpret it and store it. The key part here is "interpret" - if I don't care about the interim output, it's all fine as I could theoretically just suck it all in to a single scalar and decode it at the end. However, sometimes I need to watch for key words, phrases, tokens, whatever (and, yes, this is more problematic if the text is translated, but that's not part of this question), and report back to the user some sort of progress. So I need to utf8-decode this text incrementally.

So, the question is: is there any way to be sure I have valid UTF8 characters, and to decode them, without losing partial characters? Something like:

$buf =~ s/^((?:[\x00-\x7f]+|[\xc0-0xff][\x80-\xbf]+)*)//;
my $newtext = $1;
utf8::decode($newtext);
$text_so_far .= $newtext;
[download]

The major part of that is if that regex is right, or, if not, what would be the appropriate way to determine this?

Most of the time, I want this done, I'll have to signal when I'm using Storable in the subprocess and thus don't want any translation done, but that's a SMOP. And, if it makes things any easier, I think that line-oriented is sufficient. So, as nice as a more-general solution would be, if I understand utf8 right, the "\n" character should only show up in the byte stream if it's an actual "\n" character, and not as a secondary (or later) byte in a utf-8 character. So, would looking for whole lines be sufficient (s/^([^\n]*\n)//) to ensure I don't break up any utf-8 characters, saving the rest of the bytes for the next time through the loop? The check here would be that if the last line doesn't have a terminator, I still have to handle it, but, again, that's a SMOP.

I'm somewhat new to the whole utf8 thing, and I'm still the go-to expert on the team :-S so I want to get it right :-)

Thanks,

Comment on incremental reading of utf8 input handles Select or Download Code

Replies are listed 'Best First'.
Re: incremental reading of utf8 input handles by moritz (Cardinal) on Jul 06, 2012 at 18:09 UTC
However, if one of the many buffers involved (remote libC, remote kernel, remote sshd, remote TCP stack, switch, local TCP stack, local kernel, local ssh, local libC, AnyEvent's sysread) manages to split a UTF-8 character, there is the concern that the utf8 layer will not handle this Do I read "concernt that the utf8 layer will not handle this" correctly as "you are worried, but haven't observed the problem so far"? I for one would not be concerned unless the problem really occured, and trust perl's IO layer. In fact I've made a very simple test for this situation: `$ perl -MEncode=encode_utf8 -wE '$\| = 1; my $buf = encode_utf8 chr(0xe +5); print substr($buf, 0, 1); sleep 1; say substr($buf, 1)' \| perl -C +S -pe 1 å` [download] This splits the å into two bytes, writes the first, sleeps a second, and then writes the second byte plus a newline. The perl process reading from the pipe decodes the input as UTF-8 (that's what the -CS does), and prints it to STDOUT again. Works fine. `$buf =~ s/^((?:[\x00-\x7f]+\|[\xc0-0xff][\x80-\xbf]+)*)//; my $newtext = $1; utf8::decode($newtext); $text_so_far .= $newtext;` [download] The regex doesn't look right to me. If you have a character that is encoded as three or more bytes, the `[\xc0-0xff][\x80-\xbf]+` part could match only the first two bytes, and you wouldn't detect if the third was missing. Perl 6 - the future is here, just unevenly distributed	[reply] [d/l] [select]
Re: incremental reading of utf8 input handles by BrowserUk (Patriarch) on Jul 06, 2012 at 18:04 UTC
Warning: I'm almost certainly far more ignorant of matters unicode than you! But my innate reaction to reading your post is: don't use variable-byte characters. That is, if you are in full control of the text you are manipulating: use UTF-32 for all your text. It can represent everything, and is easy to determine whether you have complete characters. And even if some of the text comes from other sources, you will have to know what encoding it is in when you get it, and there should be nothing stopping you from re-encoding it to utf-32 whenever it is output from your processes. Just a thought into the melting pot. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. The start of some sanity?	[reply]
Re: incremental reading of utf8 input handles by Khen1950fx (Canon) on Jul 07, 2012 at 00:33 UTC
Here's a test that I tried. The first two tests fail as they should. The last 2 tests succeed. `#!/usr/bin/perl -l BEGIN { $\| = 1; $^W = 1; $ENV{'TEST_VERBOSE'} = 1; } use strict; use warnings; use Test::utf8; use Test::More tests => 4; use Encode qw/:all/; my $invalid = "\x{e9}"; Encode::_utf8_on($invalid); ok(is_valid_string($invalid)); my ($buffer, $string) = ('', ''); while (read $invalid, $buffer, 256, length $string) { $invalid .= decode( 'utf-8-strict', $buffer, Encode::FB_QUIET ); } Encode::_utf8_on($string); ok(is_valid_string($string));` [download] The `$buffer` should hold any partial incrementation.	[reply] [d/l] [select]
Re^2: incremental reading of utf8 input handles by Tanktalus (Canon) on Jul 09, 2012 at 17:06 UTC
This looks very interesting. Thanks. Unfortunately, your test doesn't seem to work here. I added a `diag "[", explain($string), "]";` to the end, and I get no output. (i.e., an empty `[]`). Also, I tried adding a "`diag '.';`" inside the while loop to see how many times it loops, and nothing came out. You're also not reading from $invalid, you need to `open my $fh, '<', \$invalid;` and then you can read from $fh. But though it now reads one time, the length of the ouput still seems to be zero. I'll see if I can adapt this test to actually have valid utf8 after multiple reads and see what comes of it. Somewhere to start from anyway :-)	[reply] [d/l] [select]
Re: incremental reading of utf8 input handles by The Code Captain (Initiate) on Jul 09, 2012 at 18:13 UTC
I don't think this is a problem - depending on which version of perl you are using, and provided that you are consistently using UTF8 in all code. (You don't have to use the same language but you do have to use the same character set.) From perlunicode: Beginning with version 5.6, Perl uses logically-wide characters to represent strings internally. Starting in Perl 5.14, Perl-level operations work with characters rather than bytes within the scope of a use feature 'unicode_strings' (or equivalently use 5.012 or higher). (This is not true if bytes have been explicitly requested by use bytes, nor necessarily true for interactions with the platform's operating system.) Whenever I have used UTF8 I have not had a problem with buffers splitting, because perl itself knows that the buffer holds characters, and how many bytes are required to represent the character. Just make sure that you are consistently using the UTF8 character set.	[reply]


Problems? Is your data what you think it is?
	PerlMonks