Re^2: HTML::Entities and Unicode quotes

Thank you for this excellent response. I found it quite illuminating.

Before posting I'd spent about 30 minutes reading perlunifaq and searching here on Perlmonks without things getting much clearer. In fact, some of what I read here was a bit disconcerting; the complaints that Perl no longer 'just worked' seemed apropos.

One source of my original confusion was that I had a file containing \xe2\x80\x9c and \xe2\x80\x9d sequences when examined using 'od -t x1 foo2' which would display correctly on Ubuntu with 'cat' in gterm. Since the Unicode table I linked showed that the sequences were valid representations of “ and ” I wondered why HTML::Entities wasn't handling it correctly, particularly when cat could.

Thanks for pointing out Encode::is_utf8($str), as I'd been wondering if there was something like this.

A couple of things are still puzzling me, though. One is, the \xe2\x80\x9d sequence is in an encoding. What's it called?

The other is that I'd like for Perl to 'just work' to whatever extent possible. Is there something that can be set at the start of a script to have all Perl IO default to ":encoding(UTF-8)"?

Comment on Re^2: HTML::Entities and Unicode quotes

Replies are listed 'Best First'.
Re^3: HTML::Entities and Unicode quotes by ikegami (Patriarch) on Aug 22, 2011 at 06:42 UTC
Thanks for pointing out Encode::is_utf8($str), as I'd been wondering if there was something like this. Ack! Please don't use that. It does NOT indicate whether something has been decoded or not. You have been misinformed. A couple of things are still puzzling me, though. One is, the \xe2\x80\x9d sequence is in an encoding. What's it called? It's the UTF-8 encoding of U+201D RIGHT DOUBLE QUOTATION MARK. Is there something that can be set at the start of a script to have all Perl IO default to ":encoding(UTF-8)"? There is open. It's not perfect, but it'll do a lot. It can handle STDIN, STDOUT and STDERR, and it can the default for `open`.	[reply] [d/l]
Re^4: HTML::Entities and Unicode quotes by tod222 (Pilgrim) on Aug 23, 2011 at 03:46 UTC
Ack! Please don't use that. It does NOT indicate whether something has been decoded or not. You have been misinformed. Yes, I saw that after I posted. Got distracted mid-post and when I got back didn't revisit the node to see the new replies. It's the UTF-8 encoding of U+201D RIGHT DOUBLE QUOTATION MARK. is_utf8 is even misleadingly named. I'd have called it "needs_utf8". But I won't use it. Regarding `use open ':encoding(utf8)';` [download] in my earlier traversal of perlunifaq I saw Using :utf8 for input can sometimes result in security breaches, so please use :encoding(UTF-8) instead. in the answer to What is the difference between :encoding and :utf8? Is ':encoding(utf8)' the same as ':encoding(UTF-8)'?	[reply] [d/l]
Re^5: HTML::Entities and Unicode quotes by ikegami (Patriarch) on Aug 23, 2011 at 06:10 UTC
is_utf8 is even misleadingly named. I'd have called it "needs_utf8". As in needs to be encoded using UTF-8? No, it doesn't indicate a need for the string to be encoded, using UTF-8 or otherwise. It is actually accurately named, but refers to how the string is stored internally, not the content of the string. Is ':encoding(utf8)' the same as ':encoding(UTF-8)'? The encoding is called UTF-8, so use "UTF-8" (case doesn't matter). I don't know how `:encoding(utf8)` is different, but I don't see any reason for figuring it out.	[reply] [d/l]
Re^3: HTML::Entities and Unicode quotes by Anonymous Monk on Aug 22, 2011 at 06:51 UTC
See perlrun#* C [_number/list_]* and open `use open # make these handles ':std', # STDIN/STDOUT/STDERR 'IO', # and any I open ':encoding(UTF-8)'; # use strict UTF-8` [download] And don't use is_utf8 :) perlunitut: Unicode in Perl#What about the UTF-8 flag?	[reply] [d/l]
Re^4: HTML::Entities and Unicode quotes by tod222 (Pilgrim) on Aug 23, 2011 at 03:54 UTC
Excellent, thanks.	[reply]


XP is just a number
	PerlMonks