http://www.perlmonks.org?node_id=217934

jkahn has asked for the wisdom of the Perl Monks concerning the following question:

I've been trying to get utf-8 encoded files to read in properly, and to parse with character semantics after loading. It seems to me that the first two printouts should be the same, but instead the one loaded from the file while the utf8 pragma was in scope (line 2) is handling length wrong, or so it appears.
#!perl -w use warnings; use strict; { use utf8; my $string = '&#601;'; # this is a schwa in UTF-8, darned handy in linguistics print length $string,"\t",$string, "\n"; my $filestring = <DATA>; chomp $filestring; print length $filestring, "\t", $filestring, "\n"; # seems like it should print "1" here... but it prints 2! } { my $string = '&#601;'; print length $string,"\t",$string, "\n"; my $filestring = <DATA>; chomp $filestring; print length $filestring, "\t", $filestring, "\n"; } __DATA__ &#601; &#601;
Note it wasn't funny ampersands in the data, but an actual utf-8 character (the upside down e, U+0259 LATIN SMALL LETTER SCHWA). (darn conversions!)

Here's the results (as pre):

1	ə
2	ə
2	ə
2	ə
It's the second line that really surprises me... shouldn't that be a '1'? The only apparent difference is that it was read off a filehandle. How can I "reset" that data to be utf8?

Here's my version of Perl (I used pre tags so that d/l code would work!):

C:\>perl -v

This is perl, v5.6.1 built for MSWin32-x86-multi-thread
(with 1 registered patch, see perl -V for more detail)

Copyright 1987-2001, Larry Wall

Binary build 633 provided by ActiveState Corp. http://www.ActiveState.com
Built 21:33:05 Jun 17 2002


Perl may be copied only under the terms of either the Artistic License or the
GNU General Public License, which may be found in the Perl 5 source kit.

Complete documentation for Perl, including FAQ lists, should be found on
this system using `man perl' or `perldoc perl'.  If you have access to the
Internet, point your browser at http://www.perl.com/, the Perl Home Page.
Anybody have any idea what's wrong here or why it gets the length wrong?