http://www.perlmonks.org?node_id=755156

way has asked for the wisdom of the Perl Monks concerning the following question:

Hi! Fellow,

I'm trying to discover the escence of the encoding, but couldn't find the right solution, if I make something like:

use Encode qw(is_utf8 encode decode); binmode STDOUT,':encoding(iso8859-1)'; my $utf8 = "This's a \x{201c}test\x{201d}"; print "Is this utf8: ",is_utf8($utf8) ? "Yes" : "No", "\n"; print encode("iso-8859-1", decode("utf8",$utf8)), "\n"; # Here perl say that "Cannot decode string with wide characters"

Why, perl say that it's an utf8 but can't decode it?, it's a right function or I have an error?, if \x{201c} is not an utf8 character, how could you know that?, what are the step to prove it?, I was tried with from_to with similar result, so i guest that i'm doing something wrong

Thank you in advance

Replies are listed 'Best First'.
Re: Decoding, Encoding string, how to? (internal encoding)
by ikegami (Patriarch) on Apr 03, 2009 at 01:34 UTC

    You're confusing the internal representation (as reported by is_utf8) and the external one.

    +-----------------------------------------------------------------+ | | | Decoded Text | | | | | | +--------------------+ downgrade +--------------------+ | | | Internally encoded | --------------> | Internally encoded | | | | as UTF-8 | | as iso-8859-1 | | | | (is_utf8 = 1) | <-------------- | (is_utf8 = 0) | | | +--------------------+ upgrade +--------------------+ | | | +-----------------------------------------------------------------+ | ^ | | encode | | decode | | v | +-----------------------------------------------------------------+ | | | Bytes or | | Encoded Text | | | | | | +--------------------+ downgrade +--------------------+ | | | Internally encoded | --------------> | Internally encoded | | | | as UTF-8 | | as iso-8859-1 | | | | (is_utf8 = 1) | <-------------- | (is_utf8 = 0) | | | +--------------------+ upgrade +--------------------+ | | | +-----------------------------------------------------------------+

    • upgrade refers to utf8::upgrade or an implicit upgrade.
    • downgrade refers to utf8::downgrade.
    • decode refers to Encode::decode, utf8::decode, :encoding on an input stream, etc.
    • encode refers to Encode::encode, utf8::encode, :encoding on an output stream, etc.
    • is_utf8 refers to Encode::is_utf8 or utf8::is_utf8 (which return the value of the UTF8 flag).

    • utf8::upgrade is safe to call on strings that are already upgraded.
    • utf8::downgrade is safe to call on strings that are already downgraded.
    • It is a bug to encode a string that's already encoded.
    • It is a bug to decode a string that's already decoded.

    Your code should be

    use Encode qw(is_utf8 encode decode); binmode STDOUT,':encoding(iso-8859-1)'; my $str = "This's a \x{201c}test\x{201d}"; # This is a "decoded" str. print "$str\n"; # Encoded by :encoding
    or
    use Encode qw(is_utf8 encode decode); my $str = "This's a \x{201c}test\x{201d}"; # This is a "decoded" str. print encode('iso-8859-1', "$str\n");

    Why, perl say that it's an utf8 but can't decode it?

    Perl said the internal encoding is UTF8. You shouldn't have care what the internal encoding is. (Unfortunately, you still need to know in some circumstances. This isn't one of those.)

    if \x{201c} is not an utf8 character

    There's no such thing as a "utf8 character" or "UTF-8 character" since utf8 and UTF-8 aren't character sets. \x{201c} generates a Unicode character (U+201C, LEFT DOUBLE QUOTATION MARK) which can be encoded using UTF-8.

      Your graphic is really helpful to understand how the encode works on Perl

      you has been really clear, just one more thing, if a want to print using iso-8859-1 it could be possible downgrading, because it changes the internal encoding to this last one and when I print the string (in normal case), i'll have an iso-8859-1 text in the output, isn't it?

      I was checked, the examples, and I performed other test using the graphic like downgrading, but i couldn't print the original (U+201C, LEFT DOUBLE QUOTATION MARK), thinking, I see that it's not representing in the iso-8859-1 charset, but I found different issues regarding that:

      1- If I downgrade the string, perl dies with a message that has wide characters and I guest, that's important becouse in other case it could be cut the internal string without notice, in fact, we can check if it's downgradeable or not using:

      my $str = "This's a \x{201c}test\x{201d}"; unless (utf8::downgrade($str, 1)) { die "Isn't downgradable\n"; }

      2- using :encoding on an output stream i can see two notice in this case, about perl can't map to iso-8859-1 but in the output appear the unmapped character as an string like \x{201c}.

      3- using Encode::encode the unmapped character is printed as an ? question symbol and not report any notice

      Thank you so much, is a great explanation

        I see question marks, but I'm not sure if there's a question in there. You seem to have a good grasp of the concept.

        if a want to print using iso-8859-1 it could be possible downgrading

        You'd get the right result, at the cost of confusing your readers. You'd be saying you're doing one thing (changing the internal format) while actually doing another (changing the encoding of the string).

        I see that it's not representing in the iso-8859-1 charset

        Correct, iso-8859-1 cannot encode U+201C. cp1252 can. cp1252 is Microsoft's extension of iso-8859-1. It's a commonly used encoding in the Windows world, which is why U+201C is encountered frequently.

        we can check if it's downgradable or not using utf8::downgrade($str, 1)

        Indeed. I have used that very code to make sure a sub was only given bytes before calling a function that expects to only get bytes. At the same time, it makes sure the bytes aren't internally encoded as UTF-8. Most XS functions can't handle that (which is really a bug in the XS function).

        using Encode::encode the unmapped character is printed as an ? question symbol and not report any notice

        How encode handles errors is configurable using its third parameter.

Re: Decoding, Encoding string, how to?
by Marshall (Canon) on Apr 03, 2009 at 02:03 UTC
    I think that ikegami is on it!
    I am studying these pages..I don't understand it all yet, but difference between ISO_8859-1 and UTF-8 appears relevant.
    http://en.wikipedia.org/wiki/ISO_8859-1#Codepage_layout
    http://en.wikipedia.org/wiki/UTF-8

    This subject can get complicated.

      Yes, i think it too, thank u