Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

Encode: unable to change encoding of strings

by Hue-Bond (Priest)
on Jul 09, 2006 at 00:08 UTC ( #559963=perlquestion: print w/ replies, xml ) Need Help??
Hue-Bond has asked for the wisdom of the Perl Monks concerning the following question:

PerlMonks seems to send pages in ISO-8859-1 encoding, which is giving me some unexpected trouble. To fix it, I've decided to use Encode to translate data to UTF-8 as soon as I download it. However, Perl is laughing at my attempts (the string is "Ámbito" in ISO-8859-1 and UTF-8 encodings):

$ perl -MEncode -e '$_="\xc3\x81mbito"; print $_, encode "iso-8859-1", + $_' | xxd 0000000: c381 6d62 6974 6fc3 816d 6269 746f ..mbito..mbito $ perl -MEncode -e '$_="\xc3\x81mbito"; print $_, decode "iso-8859-1", + $_' | xxd 0000000: c381 6d62 6974 6fc3 816d 6269 746f ..mbito..mbito $ perl -MEncode -e '$_="\xc1mbito"; print $_, encode "iso-8859-1", + $_' | xxd 0000000: c16d 6269 746f c16d 6269 746f .mbito.mbito $ perl -MEncode -e '$_="\xc1mbito"; print $_, decode "iso-8859-1", + $_' | xxd 0000000: c16d 6269 746f c16d 6269 746f .mbito.mbito

Most probably, PEBKAC but I don't see the "P". It isn't late enough (2:00 am) for this to be a silly thing :). I've "resolved" it using from_to but I don't have the feeling that this is the right way.

Update: No echo | perl needed. Changed examples accordingly. Clarified the meaning of the string literal. Updated PEBKAC link to use [jargon://].

--
David Serrano

Comment on Encode: unable to change encoding of strings
Download Code
Re: Encode: unable to change encoding of strings
by ioannis (Priest) on Jul 09, 2006 at 02:20 UTC
    This liner should work. But since I seat on a console terminal, I prefer avoiding changing fonts and consolechars to visually verify.

    The -COE sets for utf8 output, and binmode sets the STDIN for latin1.

    perl -COE -npe 'BEGIN{binmode q(:encoding(latin1))} ' < file

      The -COE sets for utf8 output

      Nice to know :^).

      However, I'm not getting data from a filehandle, but from an XML::Simple object. It's been my mistake not reflecting it clearly at the first attempt. I've updated the OP in the hope that it's better stated now.

      --
      David Serrano

Re: Encode: unable to change encoding of strings
by shmem (Canon) on Jul 09, 2006 at 08:06 UTC
    good morning Hue-Bond,

    *hint* from_to does decode and encode inplace.

    saludos,
    --shmem

    _($_=" "x(1<<5)."?\n".q·/)Oo.  G°\        /
                                  /\_¯/(q    /
    ----------------------------  \__(m.====·.(_("always off the crowd"))."·
    ");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}
      from_to does decode and encode inplace

      Yes, that's what the documentation says. I'm using it accordingly, so no surprises here.

      perl -MEncode -e '$_="\xc1mbito"; print $_; $_ = decode "iso-8859-1", $_; print encode "utf-8", $_'

      So you are using decode to translate the input from ISO-8859-1 to "Perl's internal form", whatever it is, and then encode to print it out in the desired encoding. That's two calls, and I thing this "problem" could be solved with just one, after all it's a simple matter of changing the encoding of a string! The gotcha may be in that I'm cheating by assuming that "Perl's internal form" is UTF-8 (I think it is but I shouldn't be assuming it anyway). So what I was trying was to decode the ISO-8859-1 input into UTF-8 with a call to decode and then use it without further modification (this is the third example in the OP; the others are just for illustrating the issue).

      Your snippet makes sense and agrees with what I've read recently somewhere, that says that data should be decoded when acquired, then used within the program and finally encoded again when giving it back to the outside world.

      --
      David Serrano

Re: Encode: unable to change encoding of strings
by graff (Chancellor) on Jul 09, 2006 at 22:43 UTC
    I've decided to use Encode to translate data to UTF-8 as soon as I download it. However, Perl is laughing at my attempts (the string is "Ámbito" in ISO-8859-1 and UTF-8 encodings)

    Single-byte values in the range \x80-\xFF have a somewhat ambiguous, magical status in perl 5.8; they may be either single-byte values or "wide" utf8 characters, depending on the how they are used. Consider:

    perl -e 'print "\xc1\n"' | xxd -g1 0000000: c1 0a .. perl -CO -e 'print "\xc1\n"' | xxd -g1 0000000: c3 81 0a ...
    In the second case, the -CO option on the command line tells perl to apply  binmode ":utf8" to STDOUT. Perl 5.8's default behavior for byte values in the range 80-FF is to upgrade these automatically to two-byte utf8 characters when they are written to output through a utf8 PerlIO layer, or when the scalar containing them is explicitly flagged as a utf8 string. Otherwise, they remain single-byte values.

    While playing with examples, I also came across the following, which might be instructive (if not too confusing):

    $ perl -MEncode -e '$x="\xc1"; $y = decode("iso-8859-1",$x); # $y has utf8 flag set $c = ( $x eq $y ) ? "eq":"ne"; print "$x $c $y\n";' | od -ctxC 0000000 301 e q 301 \n + c1 20 65 71 20 c1 0a + 0000007 $ perl -MEncode -e '$x="\xc1"; $y = encode("utf8",$x); # utf8 flag is not set $c = ( $x eq $y ) ? "eq":"ne"; print "$x $c $y\n";' | od -ctxC 0000000 301 n e 303 201 \n + c1 20 6e 65 20 c3 81 0a + 0000010

    The first case indicates why characters in the range 80-FF have special status in perl 5.8 (and why it's easy to get confused): they seem to be stored internally as single bytes, even when the scalar containing them is explicitly flagged as a utf8 string; whether they are single-byte or "wide" on output depends on whether you've done "binmode ':utf8'" on the given file handle. I gather this is a kind of "interim solution" intended to make a larger class of common situations "easier" to deal with (even though this default behavior is logically inconsistent with the Unicode Standard).

    The second case shows how to assign the actual two-byte utf8 sequence for Á to a scalar, but this makes it "alien" to the perl-5.8 way of doing things. (Adding "-CO" to both cases yields predictable results.)

    Anyway, if your problem is displaying PerlMonks pages or other 8859-1 text as utf8 data (which means converting from single-byte-per-char to variable-width-char), the following will suffice:

    perl -pe 'BEGIN{binmode STDOUT,":utf8"}' < file.iso > file.utf8 # or, using the more cryptic "-C" option: perl -CO -pe '' < file.iso > file.utf8

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://559963]
Approved by GrandFather
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others examining the Monastery: (11)
As of 2014-08-29 14:19 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The best computer themed movie is:











    Results (280 votes), past polls