Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

Re^2: HTML::Entities and Unicode quotes

by tod222 (Pilgrim)
on Aug 22, 2011 at 06:23 UTC ( [id://921582]=note: print w/replies, xml ) Need Help??


in reply to Re: HTML::Entities and Unicode quotes
in thread HTML::Entities and Unicode quotes

Thank you for this excellent response. I found it quite illuminating.

Before posting I'd spent about 30 minutes reading perlunifaq and searching here on Perlmonks without things getting much clearer. In fact, some of what I read here was a bit disconcerting; the complaints that Perl no longer 'just worked' seemed apropos.

One source of my original confusion was that I had a file containing \xe2\x80\x9c and \xe2\x80\x9d sequences when examined using 'od -t x1 foo2' which would display correctly on Ubuntu with 'cat' in gterm. Since the Unicode table I linked showed that the sequences were valid representations of “ and ” I wondered why HTML::Entities wasn't handling it correctly, particularly when cat could.

Thanks for pointing out Encode::is_utf8($str), as I'd been wondering if there was something like this.

A couple of things are still puzzling me, though. One is, the \xe2\x80\x9d sequence is in an encoding. What's it called?

The other is that I'd like for Perl to 'just work' to whatever extent possible. Is there something that can be set at the start of a script to have all Perl IO default to ":encoding(UTF-8)"?

  • Comment on Re^2: HTML::Entities and Unicode quotes

Replies are listed 'Best First'.
Re^3: HTML::Entities and Unicode quotes
by ikegami (Patriarch) on Aug 22, 2011 at 06:42 UTC

    Thanks for pointing out Encode::is_utf8($str), as I'd been wondering if there was something like this.

    Ack! Please don't use that. It does NOT indicate whether something has been decoded or not. You have been misinformed.

    A couple of things are still puzzling me, though. One is, the \xe2\x80\x9d sequence is in an encoding. What's it called?

    It's the UTF-8 encoding of U+201D RIGHT DOUBLE QUOTATION MARK.

    Is there something that can be set at the start of a script to have all Perl IO default to ":encoding(UTF-8)"?

    There is open. It's not perfect, but it'll do a lot. It can handle STDIN, STDOUT and STDERR, and it can the default for open.

      Ack! Please don't use that. It does NOT indicate whether something has been decoded or not. You have been misinformed.
      Yes, I saw that after I posted. Got distracted mid-post and when I got back didn't revisit the node to see the new replies.
      It's the UTF-8 encoding of U+201D RIGHT DOUBLE QUOTATION MARK.
      is_utf8 is even misleadingly named. I'd have called it "needs_utf8". But I won't use it.

      Regarding

      use open ':encoding(utf8)';
      in my earlier traversal of perlunifaq I saw
      Using :utf8 for input can sometimes result in security breaches, so please use :encoding(UTF-8) instead.
      in the answer to What is the difference between :encoding and :utf8? Is ':encoding(utf8)' the same as ':encoding(UTF-8)'?

        is_utf8 is even misleadingly named. I'd have called it "needs_utf8".

        As in needs to be encoded using UTF-8? No, it doesn't indicate a need for the string to be encoded, using UTF-8 or otherwise.

        It is actually accurately named, but refers to how the string is stored internally, not the content of the string.

        Is ':encoding(utf8)' the same as ':encoding(UTF-8)'?

        The encoding is called UTF-8, so use "UTF-8" (case doesn't matter). I don't know how :encoding(utf8) is different, but I don't see any reason for figuring it out.

Re^3: HTML::Entities and Unicode quotes
by Anonymous Monk on Aug 22, 2011 at 06:51 UTC
      Excellent, thanks.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://921582]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others rifling through the Monastery: (5)
As of 2024-04-23 18:59 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found