Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

Re: Decoding, Encoding string, how to? (internal encoding)

by ikegami (Pope)
on Apr 03, 2009 at 01:34 UTC ( #755158=note: print w/ replies, xml ) Need Help??


in reply to Decoding, Encoding string, how to?

You're confusing the internal representation (as reported by is_utf8) and the external one.

+-----------------------------------------------------------------+ | | | Decoded Text | | | | | | +--------------------+ downgrade +--------------------+ | | | Internally encoded | --------------> | Internally encoded | | | | as UTF-8 | | as iso-8859-1 | | | | (is_utf8 = 1) | <-------------- | (is_utf8 = 0) | | | +--------------------+ upgrade +--------------------+ | | | +-----------------------------------------------------------------+ | ^ | | encode | | decode | | v | +-----------------------------------------------------------------+ | | | Bytes or | | Encoded Text | | | | | | +--------------------+ downgrade +--------------------+ | | | Internally encoded | --------------> | Internally encoded | | | | as UTF-8 | | as iso-8859-1 | | | | (is_utf8 = 1) | <-------------- | (is_utf8 = 0) | | | +--------------------+ upgrade +--------------------+ | | | +-----------------------------------------------------------------+

  • upgrade refers to utf8::upgrade or an implicit upgrade.
  • downgrade refers to utf8::downgrade.
  • decode refers to Encode::decode, utf8::decode, :encoding on an input stream, etc.
  • encode refers to Encode::encode, utf8::encode, :encoding on an output stream, etc.
  • is_utf8 refers to Encode::is_utf8 or utf8::is_utf8 (which return the value of the UTF8 flag).

  • utf8::upgrade is safe to call on strings that are already upgraded.
  • utf8::downgrade is safe to call on strings that are already downgraded.
  • It is a bug to encode a string that's already encoded.
  • It is a bug to decode a string that's already decoded.

Your code should be

use Encode qw(is_utf8 encode decode); binmode STDOUT,':encoding(iso-8859-1)'; my $str = "This's a \x{201c}test\x{201d}"; # This is a "decoded" str. print "$utf8\n"; # Encoded by :encoding
or
use Encode qw(is_utf8 encode decode); my $str = "This's a \x{201c}test\x{201d}"; # This is a "decoded" str. print encode('iso-8859-1', "$utf8\n"); # Include the LF.

Why, perl say that it's an utf8 but can't decode it?

Perl said the internal encoding is UTF8. You shouldn't have care what the internal encoding is. (Unfortunately, you still need to know in some circumstances. This isn't one of those.)

if \x{201c} is not an utf8 character

There's no such thing as a "utf8 character" or "UTF-8 character" since utf8 and UTF-8 aren't character sets. \x{201c} generates a Unicode character (U+201C, LEFT DOUBLE QUOTATION MARK) which can be encoded using UTF-8.


Comment on Re: Decoding, Encoding string, how to? (internal encoding)
Select or Download Code
Re^2: Decoding, Encoding string, how to?
by way (Sexton) on Apr 03, 2009 at 05:51 UTC

    Your graphic is really helpful to understand how the encode works on Perl

    you has been really clear, just one more thing, if a want to print using iso-8859-1 it could be possible downgrading, because it changes the internal encoding to this last one and when I print the string (in normal case), i'll have an iso-8859-1 text in the output, isn't it?

    I was checked, the examples, and I performed other test using the graphic like downgrading, but i couldn't print the original (U+201C, LEFT DOUBLE QUOTATION MARK), thinking, I see that it's not representing in the iso-8859-1 charset, but I found different issues regarding that:

    1- If I downgrade the string, perl dies with a message that has wide characters and I guest, that's important becouse in other case it could be cut the internal string without notice, in fact, we can check if it's downgradeable or not using:

    my $str = "This's a \x{201c}test\x{201d}"; unless (utf8::downgrade($str, 1)) { die "Isn't downgradable\n"; }

    2- using :encoding on an output stream i can see two notice in this case, about perl can't map to iso-8859-1 but in the output appear the unmapped character as an string like \x{201c}.

    3- using Encode::encode the unmapped character is printed as an ? question symbol and not report any notice

    Thank you so much, is a great explanation

      I see question marks, but I'm not sure if there's a question in there. You seem to have a good grasp of the concept.

      if a want to print using iso-8859-1 it could be possible downgrading

      You'd get the right result, at the cost of confusing your readers. You'd be saying you're doing one thing (changing the internal format) while actually doing another (changing the encoding of the string).

      I see that it's not representing in the iso-8859-1 charset

      Correct, iso-8859-1 cannot encode U+201C. cp1252 can. cp1252 is Microsoft's extension of iso-8859-1. It's a commonly used encoding in the Windows world, which is why U+201C is encountered frequently.

      we can check if it's downgradable or not using utf8::downgrade($str, 1)

      Indeed. I have used that very code to make sure a sub was only given bytes before calling a function that expects to only get bytes. At the same time, it makes sure the bytes aren't internally encoded as UTF-8. Most XS functions can't handle that (which is really a bug in the XS function).

      using Encode::encode the unmapped character is printed as an ? question symbol and not report any notice

      How encode handles errors is configurable using its third parameter.

        You'd get the right result, at the cost of confusing your readers. You'd be saying you're doing one thing (changing the internal format) while actually doing another (changing the encoding of the string).

        Yes, my example is for understand the basic step of how perl works internally and what I can obtain handle it, in the mayor part it is a theory test, because it must be handle usually with functions like Encode::encode

        Correct, iso-8859-1 cannot encode U+201C. cp1252 can. cp1252 is Microsoft's extension of iso-8859-1, and it's commonly used encoding in the Windows world. That's why U+201C is encountered frequently.

        You right, i really don't knew it but checking the cp1252 i can see the character 201C.

        Here you have http://en.wikipedia.org/wiki/Windows-1252

        Indeed. I have used that very code to make sure a sub was only given bytes before calling a function that expects to only get bytes. At the same time, it makes sure the bytes aren't internally encoded as UTF-8. Most XS functions can't handle that (which is really a bug in the XS function).

        That's really important, i'll take mental note, of common error using XS functions.

        You're very helpful to undestand this topic, it's small but clear thank you again

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://755158]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others scrutinizing the Monastery: (13)
As of 2014-08-27 09:11 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The best computer themed movie is:











    Results (233 votes), past polls