Beefy Boxes and Bandwidth Generously Provided by pair Networks DiBona
more useful options
 
PerlMonks  

Re: incremental reading of utf8 input handles

by moritz (Cardinal)
on Jul 06, 2012 at 18:09 UTC ( #980344=note: print w/ replies, xml ) Need Help??


in reply to incremental reading of utf8 input handles

However, if one of the many buffers involved (remote libC, remote kernel, remote sshd, remote TCP stack, switch, local TCP stack, local kernel, local ssh, local libC, AnyEvent's sysread) manages to split a UTF-8 character, there is the concern that the utf8 layer will not handle this

Do I read "concernt that the utf8 layer will not handle this" correctly as "you are worried, but haven't observed the problem so far"?

I for one would not be concerned unless the problem really occured, and trust perl's IO layer.

In fact I've made a very simple test for this situation:

$ perl -MEncode=encode_utf8 -wE '$| = 1; my $buf = encode_utf8 chr(0xe +5); print substr($buf, 0, 1); sleep 1; say substr($buf, 1)' | perl -C +S -pe 1

This splits the into two bytes, writes the first, sleeps a second, and then writes the second byte plus a newline. The perl process reading from the pipe decodes the input as UTF-8 (that's what the -CS does), and prints it to STDOUT again. Works fine.

$buf =~ s/^((?:[\x00-\x7f]+|[\xc0-0xff][\x80-\xbf]+)*)//; my $newtext = $1; utf8::decode($newtext); $text_so_far .= $newtext;

The regex doesn't look right to me. If you have a character that is encoded as three or more bytes, the [\xc0-0xff][\x80-\xbf]+ part could match only the first two bytes, and you wouldn't detect if the third was missing.


Comment on Re: incremental reading of utf8 input handles
Select or Download Code

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://980344]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others lurking in the Monastery: (14)
As of 2014-04-16 19:32 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    April first is:







    Results (433 votes), past polls