Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight

minimalist perl-utf8 question

by didess (Sexton)
on Feb 02, 2013 at 06:14 UTC ( [id://1016647] : perlquestion . print w/replies, xml ) Need Help??

didess has asked for the wisdom of the Perl Monks concerning the following question:

Hi to all !

I thought I'd understood a little of character encoding ... But now I'm completely lost :

Given these 2 minimal "scripts" : (you should see 5 e-accutes in the strings )

cat print 'ייייי ' , length 'ייייי' , "\n"; cat use utf8; print 'ייייי' , length 'ייייי' , "\n";
When I run the first one, I see 5 splendid e-acutes, but length is said 10

When I run the second one, I see 5 question marks, but the length is 5

All this is done on a macbook pro, perl 5.14.2, locale on next lines, in the terminal window. The preferences of terminal are set for "UTF-8" encoding (that's why cat give good results)

Next line : an hexadecimal dump of the "print" line: One clearly sees it's utf8 encoded (0xa9c3 is e-acute)

od -x 0000000 7270 6e69 2074 c327 c3a9 c3a9 c3a9 c +3a9 0000020 20a9 2027 2c09 6c20 6e65 7467 2068 c +327 0000040 c3a9 c3a9 c3a9 c3a9 27a9 2c20 2020 5 +c22 0000060 226e 0a3b + 0000064
Any idea or explanation is welcome !!


Replies are listed 'Best First'.
Re: minimalist perl-utf8 question
by quester (Vicar) on Feb 02, 2013 at 06:36 UTC

    You need to mark STDOUT as being in UTF-8. Either binmode ..., ":encoding(UTF-8)"; or the slightly cryptic perl -C2 option will work:

    $ perl -wE 'use utf8; binmode STDOUT, ":encoding(UTF-8)"; print "ייייי + ", length "ייייי", "\n";'


    ייייי 5

    $ perl -C2 -wE 'use utf8; print "ייייי ", length "ייייי", "\n";'


    ייייי 5

Re: minimalist perl-utf8 question
by mbethke (Hermit) on Feb 02, 2013 at 06:54 UTC

    One of the nasty interactions you can have with so many components (editor, programming language, terminal) working together to interpret and re-interpret octet sequences :) As quester has already explained how to fix it, this is why:

    In the first program Perl doesn't interpret much about the embedded byte string. It sees it's 10 bytes long and writes the raw bytes to STDOUT were the terminal picks them up, recognizes they're valid UTF-8 and displays them thus. use utf8 on the other hand tells Perl to interpret the 10 bytes as UTF-8 characters so it realizes it's actually a string of only 5 characters. But when printing them, it forces them back to Latin-1 unless you use the binmode() which the terminal, expecting UTF-8, cannot parse correctly.

    Edit: you were unlucky in that י is a valid character in Latin-1 so it can be "downgraded" to 1-byte encoding without complaints. Had you had some character outside that range in your string (\x{123} works fine) you'd have gotten a "Wide character in print" warning.

Re: minimalist perl-utf8 question
by kcott (Archbishop) on Feb 02, 2013 at 07:00 UTC

    G'day, Didier,

    I don't know what you were expecting. I'll assume you expected to see either 5 or 10 output from both scripts although maybe you expected something else - please clarify.

    The utf8 pragma refers to characters in the source code - the documentation is very clear about this. So, when your e-acute characters are part of the source, what you have here is as to be expected.

    For what it's worth, I'm using a Mac Pro and the same version of Perl as you:

    $ perl -E 'say length q{ייייי}' 10 $ perl -E 'use utf8; say length q{ייייי}' 5

    If the e-acute characters are external to the source code, use utf8; will have no effect:

    $ perl -E 'say length $ARGV[0]' ייייי 10 $ perl -E 'use utf8; say length $ARGV[0]' ייייי 10

    You might also like to take a look at the length function which also has some information regarding this issue.

    -- Ken

      Thanks for explanations.

      I was expecting 5 for the length and 5 e-accutes for the string, whatever the coding of the characters.

Re: minimalist perl-utf8 question (perlunitut)
by Anonymous Monk on Feb 02, 2013 at 07:50 UTC
      I was believing the pragma "use utf8;" managed all the things ;-)

      Now, it seems clear to me

      Many thanks to all of you!