Beefy Boxes and Bandwidth Generously Provided by pair Networks RobOMonk
Think about Loose Coupling
 
PerlMonks  

encoding question

by 7stud (Deacon)
on May 20, 2010 at 07:00 UTC ( #840861=perlquestion: print w/ replies, xml ) Need Help??
7stud has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

I have a latin-1 string that I am trying to do some substitutions on, but the substitution is not happening:

use strict; use warnings; use 5.010; use Encode; my $str = "ThĀt Āpple"; #'A with circumflex' say $str; my $unicode_str = decode('iso-8859-1', $str); my $pattern = "\x{00C2}"; #unicode for 'A with circumflex' $unicode_str =~ s/$pattern//g; my $latin1_str = encode('iso-8859-1', $unicode_str); say $latin1_str;

The output from the first say(), indicates there is something wrong from the very beginning. Instead of seeing an "A with circumflex", I see an "A with tilde". Then no substitution is performed, and I see the same string that is output the first time. My terminal is set to Latin-1.

If I add a use utf8 statement, then for the first say() I see "A with diaeresis(umlaut)", but then the substitutions are performed, and the second say() outputs "Tht pple". However, the utf8 docs specifically warn against using a use utf8 statement for upper Latin-1 codes, and also because I'm not using any utf8 characters in my program file, it doesn't make sense to me to include that statement.

How can I get the first say() to output a string where I see "A with circumflex", and how can I get the substitution to work as well?

Thanks

Comment on encoding question
Download Code
Re: encoding question
by moritz (Cardinal) on May 20, 2010 at 07:13 UTC
    If add a use utf8 statement, then for the first say() I see "A with diaeresis(umlaut)", but then the substitutions are performed, and the second say() outputs "Tht pple".

    I don't see two say() statements in your program, only one... care to enlighten me?

    use utf8; is used to indicate that the script itself is stored as UTF-8. If that's not the case, don't use it.

    If I store your script as Latin-1 and execute it, the substitution works.

    Perl 6 - links to (nearly) everything that is Perl 6.

      edited my post

      use utf8; is used to indicate that the script itself is stored as UTF-8. If that's not the case, don't use it.

      That was my understanding as well, but I noticed that it allowed the substitution to work.

        So your script is stored in UTF-8.

        Update: And to elaborate, since your script is stored in UTF-8, the string literal is also UTF-8. Decoding an UTF-8 string as Latin-1 is nonsensical. Either keep your script in UTF-8, and use utf8; (preferred), or store your script as Latin-1.

Re: encoding question
by Krambambuli (Deacon) on May 20, 2010 at 07:52 UTC
    To make sure things are what they should and that you aren't fooled by anything coming in-between (locale, terminal encoding/char set, fonts used,...), inspect both the script _and_ the output with an hex editor/viewer.

    Otherwise, what you see might look good but be bad or vice-versa.


    Krambambuli
    ---
Re: encoding question
by JavaFan (Canon) on May 20, 2010 at 07:54 UTC
    The substitution should happen regardless whether the string has the UTF-8 flag set or not.

    However, what is important is how the "'A with circumflex" is encoded in the source code, and what Perl thinks the encoding is.

    To avoid problems, I always try to not have any characters with code points over 127 in my source code, and specially avoid the code points 128-255. If I were to write the line, I would write it as:

    my $str = "Th\x{92}t \x{92}pple";
    that should work regardless whether perl thinks my source code is in UTF-8 format or not. (It may still get confused if it thinks my source code is written in EBCDIC, but that's no worry for me).
      This approach gets a bit complicated when one works with texts in languages that use many letters not present in the 32..127 range, though.
      I don't really like that, sorry.

      If someone else will have to look on the code, (and just let's suppose you'll have more such characters following each other in the source,... ), 'reading' what's supposed to be there will be a chore.

      Instead, I'd emphasize in comments how the source code should be viewed _and_ what exactly should be seen.

      Your solution makes it easier for the machines, but harder for humans.

      I like it much more the other way round.


      Krambambuli
      ---
        Instead, I'd emphasize in comments how the source code should be viewed

        Two forms of such "comments" can actually be readable by programs:

        # for perl: use utf8; # at the end, for my favourite editor: # vim: fileencoding=utf-8
        Perl 6 - links to (nearly) everything that is Perl 6.
        Your solution makes it easier for the machines
        But you aren't getting the correct results. And since there's an additional cut-and-paste step involved, it's hard to debug from a website how your source is encoded.

      \x{92} is not very readable. To support that point, I bet noone even noticed you used the wrong number. (It should be C2.) \N provides a more readable mechanism:

      use charnames ':full'; my $str = "Th\N{LATIN CAPITAL LETTER A WITH CIRCUMFLEX}t \N{LATIN CAPI +TAL LETTER A WITH CIRCUMFLEX}pple";

      It's pretty long, but shortcuts are provided and you can create your own.

        I forgot about \N characters. Thanks.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://840861]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others scrutinizing the Monastery: (6)
As of 2014-04-20 16:52 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    April first is:







    Results (485 votes), past polls