Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

IO::Handle Unicode and ungetc()

by coolmichael (Deacon)
on Jan 06, 2013 at 05:35 UTC ( #1011831=perlquestion: print w/ replies, xml ) Need Help??
coolmichael has asked for the wisdom of the Perl Monks concerning the following question:

I think I've run into a problem with Unicode and IO::Handle. It's very likely I'm doing something wrong. I want to get and unget individual unicode characters (not bytes) from an IO::Handle. But I'm getting a surprising error.
#!/usr/local/bin/perl use 5.016; use utf8; use strict; use warnings; binmode(STDIN, ':encoding(utf-8)'); binmode(STDOUT, ':encoding(utf-8)'); binmode(STDERR, ':encoding(utf-8)'); my $string = qq[a ]; my $fh = IO::File->new(); $fh->open(\$string, '<:encoding(UTF-8)'); say $fh->getc(); # a say $fh->getc(); # SPACE say $fh->getc(); # LATIN CAPITAL LETTER A WITH RING ABOVE (U+00C5) $fh->ungetc(ord("")); say $fh->getc(); # should be A RING again.

The error message from the ungetc() line is "Malformed UTF-8 character (unexpected end of string) in say at unicode.pl line 21. "\x{00c5}" does not map to utf8 at unicode.pl line 21." But that's the correct hex for the character, and it should map to the character.

I used a hex editor to make sure that the bytes for A-RING are correct for UTF-8.

This seems to be a problem for any two-byte character.

The final say outputs '\xC5' (literally four characters: backslash, x, C, 5)

And I've tested this by reading from files instead of scalar variables. The result is the same.

This is perl 5, version 16, subversion 2 (v5.16.2) built for darwin-2level

Edited to add: And the script is saved in UTF-8. That was the first thing I checked.

Comment on IO::Handle Unicode and ungetc()
Download Code
Re: IO::Handle Unicode and ungetc()
by quester (Vicar) on Jan 06, 2013 at 06:28 UTC

    You can work around this by adding a call to binmode just after the open,

    $fh->open(\$string, '<:encoding(UTF-8)'); binmode $fh;

    However, that only works if you do NOT use either of the ":utf8" or ":encoding(UTF-8)" options on binmode. I'm not that familiar with perl Unicode, but this seems vaguely bug-ish to me offhand.

    Update: tested on perl v5.14.3 built for x86_64-linux-thread-multi

Re: IO::Handle Unicode and ungetc()
by Anonymous Monk on Jan 06, 2013 at 07:53 UTC

      I don't think that's the same thing. This problem happens with on disk files as well as in-memory file handles.

      It looks like ord() is returning the correct unicode code point, but ungetc() is interpreting it as a byte sequence instead of a code point. That seems like a different bug to me.

Re: IO::Handle Unicode and ungetc()
by Leon Timmermans (Novice) on Jan 09, 2013 at 21:42 UTC
    It seems ungetc is completely unicode-unaware. Looks like this can be fixed easily though.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1011831]
Approved by Athanasius
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others examining the Monastery: (12)
As of 2014-09-16 12:55 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (18 votes), past polls