Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Setting UTF-8 mode on filehandle reads?

by jkahn (Friar)
on Dec 05, 2002 at 23:22 UTC ( #217934=perlquestion: print w/ replies, xml ) Need Help??
jkahn has asked for the wisdom of the Perl Monks concerning the following question:

I've been trying to get utf-8 encoded files to read in properly, and to parse with character semantics after loading. It seems to me that the first two printouts should be the same, but instead the one loaded from the file while the utf8 pragma was in scope (line 2) is handling length wrong, or so it appears.
#!perl -w use warnings; use strict; { use utf8; my $string = '&#601;'; # this is a schwa in UTF-8, darned handy in linguistics print length $string,"\t",$string, "\n"; my $filestring = <DATA>; chomp $filestring; print length $filestring, "\t", $filestring, "\n"; # seems like it should print "1" here... but it prints 2! } { my $string = '&#601;'; print length $string,"\t",$string, "\n"; my $filestring = <DATA>; chomp $filestring; print length $filestring, "\t", $filestring, "\n"; } __DATA__ &#601; &#601;
Note it wasn't funny ampersands in the data, but an actual utf-8 character (the upside down e, U+0259 LATIN SMALL LETTER SCHWA). (darn conversions!)

Here's the results (as pre):

1	ə
2	ə
2	ə
2	ə
It's the second line that really surprises me... shouldn't that be a '1'? The only apparent difference is that it was read off a filehandle. How can I "reset" that data to be utf8?

Here's my version of Perl (I used pre tags so that d/l code would work!):

C:\>perl -v

This is perl, v5.6.1 built for MSWin32-x86-multi-thread
(with 1 registered patch, see perl -V for more detail)

Copyright 1987-2001, Larry Wall

Binary build 633 provided by ActiveState Corp. http://www.ActiveState.com
Built 21:33:05 Jun 17 2002


Perl may be copied only under the terms of either the Artistic License or the
GNU General Public License, which may be found in the Perl 5 source kit.

Complete documentation for Perl, including FAQ lists, should be found on
this system using `man perl' or `perldoc perl'.  If you have access to the
Internet, point your browser at http://www.perl.com/, the Perl Home Page.
Anybody have any idea what's wrong here or why it gets the length wrong?

Comment on Setting UTF-8 mode on filehandle reads?
Download Code
Re: Setting UTF-8 mode on filehandle reads?
by diotalevi (Canon) on Dec 06, 2002 at 01:03 UTC

    I spent a while reading the unicode documentation with perl and it looks like you need to apply unicode attributes to the filehandle. My 5.6.1 manual only mentions it as being difficult and never actually documents how to make this work. Stepping up to 5.8.0 documentation results in the following gems: binmode DATA, ':utf8
    for already opened handles, open(my $fh, '<:utf8', 'anything') for new files. open can be overridden to have unicode semantics by default by using use open ':utf8';. You can read this yourself in the perluniintro document. Perhaps someone here can fill in what the mystery 5.6.1 incantation is.

    __SIG__ use B; printf "You are here %08x\n", unpack "L!", unpack "P4", pack "L!", B::svref_2object(sub{})->OUTSIDE;
Re: Setting UTF-8 mode on filehandle reads?
by grantm (Parson) on Dec 06, 2002 at 01:14 UTC

    If you were using Perl 5.8, I'd suggest pushing an encoding layer when you opened the file (or after with binmode). As you're not, I won't.

    Here's a quick script that reads a file line-by-line and uses pack to set the UTF-8 flag on each string read in. After that flag is set, character semantics work as expected for wide characters that were read in from the file:

    use utf8; use CGI::Carp qw(fatalsToBrowser); print "Content-type: text/html; charset=utf-8\n\n"; open(FILE, "<", "/path/to/utf8/file.txt") || die "$!"; print "<pre>\n"; while(<FILE>) { chomp; $_ = set_utf($_); my $len = length($_); # count of chars not bytes print "$_", ' ' x (72 - $len), "|\n"; } print "</pre>\n"; sub set_utf { return pack "U0a*", join '', @_; }

    I fashioned the script as a CGI script so that you can view the output in your browser - which understands UTF-8 characters (whereas your TTY might not). Given a UTF-8 text file with lines less than 80 characters, this should pad each line out to 80 characters with spaces and then append a '|'. If character semantics are not in force, the length will count bytes rather than characters and the '|'s won't line up.

       open(FILE, "<", "/path/to/utf8/file.txt") || die "$!";

      Why don't use perl 5.8.0 and simply specify IO 'layer' when using three-argument form of open?

      open(FILE, '<:utf8', '/path/to/utf8/file.txt') || die $!;
        Why don't use perl 5.8.0...

        um, because the original poster was looking for a 5.6.1 solution. (And if you actually read my reply you'd see that I suggested exactly what you propose in the first line!)

Re: Setting UTF-8 mode on filehandle reads?
by pg (Canon) on Dec 06, 2002 at 15:42 UTC
    Perl's current strategy on utf8, is to make it work with the least modification. In perlunicode, it is clearly stated that, Perl does not cover unicode standards from cover to cover. Also in another perl doc, I forgot which one, it is said that, Perl will remain like this until unicode is inescapable. This is something we have to be aware of all the time.

    Perl's way of handling unicode I/O is a good evidence of this strategy. A layer called ':utf8' was inserted between your program and your descriptors. I would expect this to be totally revised, when Perl 6 comes out, otherwise I feel some real worry here.

    To make this ':utf8' layer come to work, you have to explicitly add it. Yet this is something separate from the 'utf8' pragmas. That pragmas simply does not affect your I/O at all.

    For ':utf' layer itself:
    Examples
    at openopen(FILEHANDLE, "<:utf8", "abc.utf8.txt")
    after openbinmode(STDOUT, ":utf8")
      I would expect this to be totally revised, when Perl 6 comes out, otherwise I feel some real worry here.

      I'm not quite sure what you're getting at here. You will always need to tell Perl that you want it to use UTF-8 encoding when you read a specific file. Sure in the future some of the region-specific encodings such as Latin-1 might lose popularity to Unicode. But if Perl assumed every file was a UTF-8 character stream then Perl would no longer be able to read binary byte streams (or even UTF-16 encoded).

      The XML spec provides a way for a program to unambiguously determine the encoding of an XML document. In the absense of this type of in-band information in other text file formats, you will need to specify an encoding.

      As you point out, 5.8 provides the very powerful IO layer model for dealing with this and other problems. I don't expect IO layers to disappear in 6.0. And for people stuck with 5.6, pack hack's do provide a workaround.

      What is expected to change in the future is that Perl will assume your script itself is UTF-8 encoded. Assuming you use a UTF-8 aware editor, that will allow you to include non-ASCII characters in string literals simply by typing them. At the moment if you want to do that you have to say 'use utf8' in the future that will be assumed (and to quote the docs, "'use utf8' will become a noop").

        "You will always need to tell Perl that you want it to use UTF-8 encoding" Unless that's the default. Of course, if you want to read a binary file, you can always ask for that, but there's no reason in this day and age that UTF8 encoding wouldn't be the default.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://217934]
Approved by premchai21
Front-paged by diotalevi
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others avoiding work at the Monastery: (4)
As of 2014-07-13 08:54 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    When choosing user names for websites, I prefer to use:








    Results (248 votes), past polls