Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number

Re: incremental reading of utf8 input handles

by moritz (Cardinal)
on Jul 06, 2012 at 18:09 UTC ( #980344=note: print w/replies, xml ) Need Help??

in reply to incremental reading of utf8 input handles

However, if one of the many buffers involved (remote libC, remote kernel, remote sshd, remote TCP stack, switch, local TCP stack, local kernel, local ssh, local libC, AnyEvent's sysread) manages to split a UTF-8 character, there is the concern that the utf8 layer will not handle this

Do I read "concernt that the utf8 layer will not handle this" correctly as "you are worried, but haven't observed the problem so far"?

I for one would not be concerned unless the problem really occured, and trust perl's IO layer.

In fact I've made a very simple test for this situation:

$ perl -MEncode=encode_utf8 -wE '$| = 1; my $buf = encode_utf8 chr(0xe +5); print substr($buf, 0, 1); sleep 1; say substr($buf, 1)' | perl -C +S -pe 1

This splits the into two bytes, writes the first, sleeps a second, and then writes the second byte plus a newline. The perl process reading from the pipe decodes the input as UTF-8 (that's what the -CS does), and prints it to STDOUT again. Works fine.

$buf =~ s/^((?:[\x00-\x7f]+|[\xc0-0xff][\x80-\xbf]+)*)//; my $newtext = $1; utf8::decode($newtext); $text_so_far .= $newtext;

The regex doesn't look right to me. If you have a character that is encoded as three or more bytes, the [\xc0-0xff][\x80-\xbf]+ part could match only the first two bytes, and you wouldn't detect if the third was missing.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://980344]
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others chilling in the Monastery: (5)
As of 2018-05-23 17:45 GMT
Find Nodes?
    Voting Booth?