Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

Re: HTML::Entities and Unicode quotes

by Your Mother (Bishop)
on Aug 20, 2011 at 02:06 UTC ( #921354=note: print w/replies, xml ) Need Help??


in reply to HTML::Entities and Unicode quotes

This will, I hope, explain what’s going on–

use warnings; use strict; use Encode; use HTML::Entities; my $str = "\xe2\x80\x9cquotes\xe2\x80\x9d"; print "Is $str UTF-8? ", Encode::is_utf8($str) ? "Yes!\n" : "No...\n"; $str = decode("UTF-8", $str, Encode::FB_CROAK); binmode STDOUT, ":encoding(UTF-8)"; print "It's still $str... UTF-8 now? ", Encode::is_utf8($str) ? "Yes!\n" : "No...\n"; my $wide_chars = "\x{201C}quotes\x{201D}"; print "How about this version: $wide_chars? ", Encode::is_utf8($wide_chars) ? "Yes!\n" : "No...\n"; print "Entities: ", encode_entities($str), $/; __END__ Is “quotes” UTF-8? No... It's still “quotes”... UTF-8 now? Yes! How about this version: “quotes”? Yes! Entities: “quotes”

Update: changed $non_combining to $wide_chars as the name was misleading.

Replies are listed 'Best First'.
Re^2: HTML::Entities and Unicode quotes
by ikegami (Pope) on Aug 20, 2011 at 06:22 UTC
    The internal storage format (returned by is_utf8) has absolutely nothing to do with this.
    use Encode; use HTML::Entities; my $str = "\xe2\x80\x9cquotes\xe2\x80\x9d"; utf8::downgrade($str); print Encode::is_utf8($str) ? 1 :0, " ", encode_entities($str), "\n"; utf8::upgrade($str); print Encode::is_utf8($str) ? 1 :0, " ", encode_entities($str), "\n";
    0 “quotes” 1 “quotes”

      It was just to show what the natural state of the strings was assumed to be by perl. You artificially flipped the switch on/off—no decoding or encoding. You also showed code using the functions of utf8 which is probably a bad example to set. You know exactly what you’re doing but someone who doesn’t sees a top monk using it they think, oh, that must be a good idea, I’ll use upgrade and downgrade to “fix” my encodings too.

        You know exactly what you’re doing but someone who doesn’t sees a top monk using it they think, oh, that must be a good idea

        You're making my point for me. is_utf8 should never ever ever ever be used. So what are you doing using it, especially to someone who think it might be ok to use it?

        You artificially flipped the switch on/off—no decoding or encoding.

        No, I didn't. The UTF8 flag does not indicate any such thing.

        Not only are you showing a function you shouldn't be using, you're showing how to use it incorrectly.

        You also showed code using the functions of utf8 which is probably a bad example to set.

        There's nothing wrong with the utf8:: module. There's something wrong with the is_utf8 function, though. is_utf8 should never ever ever ever be used.

        On the other hand, there's absolutely no problem using ugprade and downgrade. First, they're not suppose to have any effect whatsoever. Secondly, they are required to work around bugs in Perl and XS modules.

        You're the one setting the bad example. I just had to use advanced functions to show that.

        they think, oh, that must be a good idea, I’ll use upgrade and downgrade to “fix” my encodings too.

        Good! You seem to be implying that's bad, but if upgrade or downgrade have an effect, they are the correct tool to use. They'll only do a difference when faced with a bug, and they're the only tool that will help in that situation.

Re^2: HTML::Entities and Unicode quotes
by tod222 (Pilgrim) on Aug 22, 2011 at 06:23 UTC
    Thank you for this excellent response. I found it quite illuminating.

    Before posting I'd spent about 30 minutes reading perlunifaq and searching here on Perlmonks without things getting much clearer. In fact, some of what I read here was a bit disconcerting; the complaints that Perl no longer 'just worked' seemed apropos.

    One source of my original confusion was that I had a file containing \xe2\x80\x9c and \xe2\x80\x9d sequences when examined using 'od -t x1 foo2' which would display correctly on Ubuntu with 'cat' in gterm. Since the Unicode table I linked showed that the sequences were valid representations of “ and ” I wondered why HTML::Entities wasn't handling it correctly, particularly when cat could.

    Thanks for pointing out Encode::is_utf8($str), as I'd been wondering if there was something like this.

    A couple of things are still puzzling me, though. One is, the \xe2\x80\x9d sequence is in an encoding. What's it called?

    The other is that I'd like for Perl to 'just work' to whatever extent possible. Is there something that can be set at the start of a script to have all Perl IO default to ":encoding(UTF-8)"?

      Thanks for pointing out Encode::is_utf8($str), as I'd been wondering if there was something like this.

      Ack! Please don't use that. It does NOT indicate whether something has been decoded or not. You have been misinformed.

      A couple of things are still puzzling me, though. One is, the \xe2\x80\x9d sequence is in an encoding. What's it called?

      It's the UTF-8 encoding of U+201D RIGHT DOUBLE QUOTATION MARK.

      Is there something that can be set at the start of a script to have all Perl IO default to ":encoding(UTF-8)"?

      There is open. It's not perfect, but it'll do a lot. It can handle STDIN, STDOUT and STDERR, and it can the default for open.

        Ack! Please don't use that. It does NOT indicate whether something has been decoded or not. You have been misinformed.
        Yes, I saw that after I posted. Got distracted mid-post and when I got back didn't revisit the node to see the new replies.
        It's the UTF-8 encoding of U+201D RIGHT DOUBLE QUOTATION MARK.
        is_utf8 is even misleadingly named. I'd have called it "needs_utf8". But I won't use it.

        Regarding

        use open ':encoding(utf8)';
        in my earlier traversal of perlunifaq I saw
        Using :utf8 for input can sometimes result in security breaches, so please use :encoding(UTF-8) instead.
        in the answer to What is the difference between :encoding and :utf8? Is ':encoding(utf8)' the same as ':encoding(UTF-8)'?
        Excellent, thanks.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://921354]
help
Chatterbox?
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others examining the Monastery: (8)
As of 2018-07-17 02:44 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    It has been suggested to rename Perl 6 in order to boost its marketing potential. Which name would you prefer?















    Results (354 votes). Check out past polls.

    Notices?