Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

UTF-8 for Everything

by shawnhcorey (Pilgrim)
on Jul 06, 2013 at 12:48 UTC ( #1042907=perlquestion: print w/ replies, xml ) Need Help??
shawnhcorey has asked for the wisdom of the Perl Monks concerning the following question:

What is the best way to get Perl to use UTF-8 for everything, except for when I explicitly state otherwise. I was using use encoding qw( UTF-8 ); but in Perl 5.18 it's deprecated. I using this as a stop-gap:

use open qw( :encoding(utf8) ); binmode STDIN, qw{ :encoding(UTF-8) }; binmode STDOUT, qw{ :encoding(UTF-8) }; binmode STDERR, qw{ :encoding(UTF-8) };

Surely, there must be a more elegant way.

And a related question: In the HTML::TreeBuilder documentation, it says, "When you pass a filename to "parse_file", HTML::Parser opens it in binary mode, which means it's interpreted as Latin-1 (ISO-8859-1). If the file is in another encoding, like UTF-8 or UTF-16, this will not do the right thing."

What would be a good replacement for HTML::TreeBuilder, keeping in mind that not all HTML pages are XML compliant?

Comment on UTF-8 for Everything
Download Code
Re: UTF-8 for Everything
by Khen1950fx (Canon) on Jul 06, 2013 at 13:37 UTC
    Have you looked at utf8::all?
    #!/usr/bin/perl use strict; use warnings; use utf8::all;

      Sadly, not part of the standards modules. ☹

        But everything utf8::all does is achievable with core modules. Take a look at the utf8::all source code and replicate what it does.

        package Cow { use Moo; has name => (is => 'lazy', default => sub { 'Mooington' }) } say Cow->new->name
Re: UTF-8 for Everything
by sundialsvc4 (Monsignor) on Jul 06, 2013 at 18:38 UTC

    Could you please elaborate, Toby?   I’m a little surprised at the notice of use encoding being deprecated, andI don’t quite know what you mean about “is achievable with core modules.”   It would be very helpful if this thread could end with, “this is the situation and this is what you should do now, for example ...” in a wee bit more cook-book detail.   Thanks in advance.

Re: UTF-8 for Everything
by duelafn (Priest) on Jul 06, 2013 at 20:16 UTC

    Include the :std argument to your use open line to auto-binmode the STD*.

    Pass a filehandle to HTML::TreeBuilder's parse_file() instead of a file name.

    Good Day,
        Dean

      Thanks, I missed that one.

Re: UTF-8 for Everything
by vsespb (Hermit) on Jul 06, 2013 at 21:40 UTC
Re: UTF-8 for Everything
by perl-diddler (Hermit) on Jul 07, 2013 at 02:28 UTC
    When HTML::Parser opens a file, at the top of the file one of the first things should be a header that specifies the encoding. HTML::Parser has to switch it's encoding according to the header or it would strongly fail alot of parsing out there. Switching decoding on the fly is a requirement of HTML Parsers.
Re: UTF-8 for Everything
by Jim (Curate) on Jul 07, 2013 at 05:15 UTC
    I was using use encoding qw( UTF-8 ); but in Perl 5.18 it's deprecated.

    As vsespb said, do this instead:

    use utf8;

    This pragma allows you to use non-ASCII characters inside your Perl script—that is, characters outside the Basic Latin block of Unicode.

    I'm using this as a stop-gap:
    use open qw( :encoding(utf8) ); binmode STDIN, qw{ :encoding(UTF-8) }; binmode STDOUT, qw{ :encoding(UTF-8) }; binmode STDERR, qw{ :encoding(UTF-8) };

    As duelafn said, do this instead:

    use open qw( :encoding(UTF-8) :std );

    Carefully read Tom Christiansen's (tchrist) brilliant and exhaustive Stack Overflow post Go Thou and Do Likewise. Pay particular attention to the first section titled Simplest Rx:  7 Discrete Recommendations. These seven recommendations are essentially the answer to your question, "What is the best way to get Perl to use UTF-8 for everything?" Then read jrockway's excellent followup post.

    Tom's Stack Overflow post evolved into a presentation that he gave at OSCON 2011. The slides are here.

      I only regret that I have but one up-vote to lend to your posting, and especially the links.   Thank you very much.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1042907]
Approved by Corion
Front-paged by ww
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others having an uproarious good time at the Monastery: (10)
As of 2014-09-02 21:34 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite cookbook is:










    Results (30 votes), past polls