Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Windows-1252 characters from \x{0080} thru \x{009f}

by Jim (Curate)
on Apr 18, 2012 at 22:46 UTC ( #965817=perlquestion: print w/ replies, xml ) Need Help??
Jim has asked for the wisdom of the Perl Monks concerning the following question:

This seems plainly wrong to me:

C:\>chcp Active code page: 1252 C:\>type match_test_1.pl #!perl use strict; use warnings; my $pattern = qr/\A\w+\z/; my @words = qw( Tšekissä Žena Œdipus Rex ); for my $word (@words) { my $result = $word =~ $pattern ? "matches" : "doesn't match"; printf qq/The word "%s" %s the pattern %s\n/, $word, $result, $pat +tern; } C:\>perl match_test_1.pl The word "Tšekissä" doesn't match the pattern (?^:\A\w+\z) The word "Žena" doesn't match the pattern (?^:\A\w+\z) The word "Œdipus" doesn't match the pattern (?^:\A\w+\z) The word "Rex" matches the pattern (?^:\A\w+\z) C:\>type match_test_2.pl #!perl use strict; use warnings; use open qw( :encoding(Windows-1252) :std ); my $pattern = qr/\A\w+\z/; my @words = qw( Tšekissä Žena Œdipus Rex ); for my $word (@words) { my $result = $word =~ $pattern ? "matches" : "doesn't match"; printf qq/The word "%s" %s the pattern %s\n/, $word, $result, $pat +tern; } C:\>perl match_test_2.pl "\x{009a}" does not map to cp1252 at match_test_2.pl line 12. The word "T\x{009a}ekissä" doesn't match the pattern (?^:\A\w+\z) "\x{008e}" does not map to cp1252 at match_test_2.pl line 12. The word "\x{008e}ena" doesn't match the pattern (?^:\A\w+\z) "\x{008c}" does not map to cp1252 at match_test_2.pl line 12. The word "\x{008c}dipus" doesn't match the pattern (?^:\A\w+\z) The word "Rex" matches the pattern (?^:\A\w+\z) C:\>perl -v This is perl 5, version 14, subversion 2 (v5.14.2) built for MSWin32-x +86-multi-thread Copyright 1987-2011, Larry Wall Perl may be copied only under the terms of either the Artistic License + or the GNU General Public License, which may be found in the Perl 5 source ki +t. Complete documentation for Perl, including FAQ lists, should be found +on this system using "man perl" or "perldoc perl". If you have access to + the Internet, point your browser at http://www.perl.org/, the Perl Home Pa +ge. C:\>

What's going on? Why is Perl complaining that, for example, "\x{009a} does not map to cp1252." It does map to cp1252 (Windows-1252 or ANSI).

Jim

Comment on Windows-1252 characters from \x{0080} thru \x{009f}
Download Code
Re: Windows-1252 characters from \x{0080} thru \x{009f} (source-code encoding)
by tye (Cardinal) on Apr 19, 2012 at 01:44 UTC

    No, "\x{009a}" (a Unicode character) does not map to cp1252.

    You did not tell Perl a specific encoding to use for your source code. So Perl assumed that your source code was encoded in Latin-1. Your examples show that you treated your source code as encoded in Windows-1252. So it isn't particularly surprising that Perl and you disagree about some of the characters in your source code (hard-coded into string literals).

    So, for example, byte \x9a looks like an accented character when interpreted as Windows-1252 (something that this website also does -- check the headers). It looks just like (is the same character as) the Unicode character "\x{0161}" (š).

    But Perl assumes that byte \x9a is in Latin-1 and so treats it the same as the Unicode character "\x{009a}" (a control character, 'single character introducer', that shouldn't be visible if I tried to reproduce it here), which is a character not available in Windows-1252.

    So Perl tells you that it can't convert that character to Windows-1252.

    Now, it has become very common for things claiming to be Latin-1 to actually include bytes from Windows-1252 with the desire and expectation to have them interpreted as Windows-1252 not as Latin-1. So common that w3c even decided that web pages claiming to be Latin-1 should actually just be treated like they claimed that they were Windows-1252.

    And it looks like that decision may have confused, for example, http://www.fileformat.info/info/unicode/char/009a/index.htm, which (for me, anyway) shows a nice hatted 's' despite claiming it is an "Other, Control" type of character (compare to http://www.fileformat.info/info/unicode/char/0161/index.htm).

    [ Note that the w3c declaring "treat Latin-1 as Windows-1252" for web pages, does not change the definition of either of those character sets nor have any impact on how Encode converts between them nor on how Perl treats script source code (not downloaded from a web page). ]

    - tye        

      Thank you for your thorough explanation, tye. You answered my question.

      The W3C is doing the right thing. (See 8.2.2.2 Character encodings in the HTML5 working draft specification.) Its willful violation of anachronistic standards for compelling, practical reasons is, IMHO, a practice that is overdue in Perl 5. By now, Perl 5 should also be defaulting to Windows-1252 instead of to ISO 8859-1 (Latin 1). Its failure to do this is one of the little things that make Perl 5 seem old and crufty, especially to Windows programmers. By dogmatically adhering to some misguided commitment to compatibility and portability, Perl 5 violates the principle of least astonishment.

      By the way, I had done something like this…

      C:\>chcp Active code page: 1252 C:\>type match_test_3.pl #!perl use strict; use warnings; use open qw( :encoding(Windows-1252) :std ); my $pattern = qr/\A\w+\z/; for my $word (@ARGV) { my $result = $word =~ $pattern ? "matches" : "doesn't match"; printf qq/The word "%s" %s the pattern %s\n/, $word, $result, $pat +tern; } C:\>perl match_test_3.pl Tšekissä Žena Œdipus Rex "\x{009a}" does not map to cp1252 at match_test_3.pl line 12. The word "T\x{009a}ekissä" doesn't match the pattern (?^:\A\w+\z) "\x{008e}" does not map to cp1252 at match_test_3.pl line 12. The word "\x{008e}ena" doesn't match the pattern (?^:\A\w+\z) "\x{008c}" does not map to cp1252 at match_test_3.pl line 12. The word "\x{008c}dipus" doesn't match the pattern (?^:\A\w+\z) The word "Rex" matches the pattern (?^:\A\w+\z) C:\>

      …before I posted my inquiry here to prove to myself that the problem wasn't just with the use within the Perl source file of Windows-1252 characters in the range from 80 thru 9F.

      There's a Feedback button at the bottom of the page http://www.fileformat.info/info/unicode/char/009a/index.htm. ☺

      Thanks again.

      Jim

        By now, Perl 5 should also be defaulting to Windows-1252 instead of to ISO 8859-1 (Latin 1)

        I don't know of a single place where Perl assumes iso-8859-1.

        There are many places where Perl requires strings of Unicode code points. (In the above program, those would be the match operator and the encoder.) Since the strings passed to those were created by assigning each byte to a character, each byte is taken to be a Unicode code point. Not an iso-8859-1 character.

        This makes it *look* like Perl defaults to iso-8859-1, but there is no "default" since there is only ever one thing those functions can accept. Because there is no default, it also means the default cannot be changed, to cp1252 or anything else.

        Perl 5 should also be defaulting to Windows-1252 instead of to ISO 8859-1 (Latin 1).

        I really hope this will NEVER happen, not even on a Windows platform. cp1252 is only "default" on Windows, it is not the default on any other platform and changing perl5's default to cp1252 would break every script that assumes the current default (wise or not).

        Most perl scripts are cross-platform portable, at least they can be when the programmer follows the basic porting rules. Most of my scripts and modules are cross platform, and I do test my modules on HP-UX, Linux, AIX and Windows (and sometimes even on OSX when I can access such architecture).

        That said, the default IMHO is likely to change for Windows. If not in Windows 8 (or whatever they will call it) then maybe Windows 9 or 10 will have Unicode as default character set. Problem solved. I already use UTF-8 as default encoding on all my browsers (Opera, Firefox, Konqueror, Opera Mobile) and IRC.

        My advise to you would be to switch to using utf-8 (and declare 'use utf8;' next to use strict; and use warnings; in the head of your scripts when you do.


        Enjoy, Have FUN! H.Merijn
        Perl 5 should also be defaulting to Windows-1252 … little things that make Perl 5 seem old and crufty … Perl 5 violates the principle of least astonishment.

        So, a highly limited Latin only encoding seems modern/uncrufty to you in 2012? There are many encodings and it’s pretty easy with newer perls to use whatever you like or to default to the entirely reasonable utf-8. And not to put too fine a point on it—as the kids used to say and with full knowledge that two of the very best hackers on PM are WinCats—but I’ve never ceased to be astonished that anyone used Windows ever.

Re: Windows-1252 characters from \x{0080} thru \x{009f}
by graff (Chancellor) on Apr 19, 2012 at 02:01 UTC
    tye has covered most of the important stuff. I'll just add that in order for your first code snippet to DWYM, it would have to go something like this (note the addition of "use Encode", setting the io layer on STDOUT, and applying "decode" to the literals being assigned to @words):
    #!perl use strict; use warnings; use Encode; binmode STDOUT, ":encoding(cp1252)"; my $pattern = qr/\A\w+\z/; my @words = map { decode( "cp1252", $_ ) } qw( Tšekissä Žena Œdipus +Rex ); for my $word (@words) { my $result = $word =~ $pattern ? "matches" : "doesn't match"; printf qq/The word "%s" %s the pattern %s\n/, $word, $result, $pat +tern; }
    When I run that in a terminal that is using cp1252 (aka "Windows Latin1"), the resulting output is:
    The word "Tšekissä" matches the pattern (?-xism:\A\w+\z) The word "Žena" matches the pattern (?-xism:\A\w+\z) The word "Œdipus" matches the pattern (?-xism:\A\w+\z) The word "Rex" matches the pattern (?-xism:\A\w+\z)
    UPDATE: To clarify, the point here is that when it comes to matching things outside the ASCII range, regex expressions like '\w' will only employ unicode semantics, not cp1252 or any other semantics, so they need to operate on strings that have their perl-internal-utf8 flag set to true (i.e. have been decoded from "external" forms, whether by reading through the appropriate io layer, or by explicit decoding).

      Thank you very much, graff. Your reply filled in the all-import How-do-you-do-it? gap.

      Jim

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://965817]
Approved by Eliya
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others exploiting the Monastery: (12)
As of 2014-07-28 13:18 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (198 votes), past polls