Re: Pragma to handle unicode characters
by ikegami (Patriarch) on Dec 22, 2008 at 01:32 UTC
|
binmode STDIN, ":utf8";
binmode STDOUT, ":utf8";
should be
binmode STDIN, ":encoding(UTF-8)";
binmode STDOUT, ":encoding(UTF-8)";
binmode STDERR, ":encoding(UTF-8)";
and can be replaced with
use open ':std', ':encoding(UTF-8)';
or better yet
use open ':std', ':locale';
:utf8 on input is insecure.
open(IN, "<utf8", "sample.txt");
should be
open(IN, "<:encoding(UTF-8)", "sample.txt");
and can be replaced with
use open IO => ':encoding(UTF-8)';
or
use open IO => ':locale';
use utf8;
indicates the source file is encoded using UTF-8. Without it, Perl assumes it's encoded using iso-latin-1.
None of the above decodes the content of @ARGV, data read from open(FH, '-') or data read via <>. The last is being worked on.
| [reply] [d/l] [select] |
Re: Pragma to handle unicode characters (roads)
by tye (Sage) on Dec 22, 2008 at 08:42 UTC
|
Unicode was designed during the era where the concept of "byte stream" had become ubiquitous. I see but little evidence in the design of Unicode that its designers had much appreciation for the prior, messier era nor especially for the fact that what they were designing was going to destroy the then-current comfortable "everything is a byte stream" world.
There were clearly plans for the eventual utopia of "all character strings/streams are in Unicode" but if there were plans for the uncomfortable transition period that we are currently moving into the heart of (things will still continue to get worse for a little while longer before they start getting better), I haven't seen much evidence of that.
I'd expect to see leadership on this front from one or more sources of Unix operating systems (Linux, BSD, Sun, etc.). If it is there, I really haven't seen it. I still haven't seen evidence of a plan for this transition in Unix. I see incomplete pieces that try to cover the "before" case (everything is in Latin-1 or whatever) and try to cover the "after" case (everything is UTF-8), but little that deals with the mixed bag one currently usually finds oneself in, such as: I want to move forward with UTF-8 data in many of my files but tool Z can't handle that so I need to use Latin-1 data for Z but I want to keep using filenames and command lines written in my preferred Latin-2 for now.
But dealing with this transition requires defining ways for applications to declare what type of data they are prepared to deal with (covering many different interfaces: command-line arguments, environment variables, file names, text streams, etc.).
A relatively simple approach that I'd expect to see in Unix would be to enable a Unix to be built such that all text is stored in UTF-8 (file names, environment variable names and their values, text in configuration files, etc.). Then be able to declare that "application Z" only understands Latin-1 and so "application Z" gets passed an environment encoded in Latin-1 and has filename accesses translated between Latin-1 and UTF-8 for it, and text streams also get converted for it.
Actually, Win32 is over a decade ahead of Unix on this front. WinNT did all system work in UTF-16 and let each program declare whether it wanted to use single-byte characters or "UNICODE" characters. Programs can even do a little bit of extra work and access both the single-byte-character APIs and the native UTF-32 APIs.
That is why it was relatively easy for me to add Unicode support for file-system operations to Win32API::File (now if only I could finish testing and integrating those changes and get them uploaded to CPAN).
But Perl has followed along with Unix and is still mostly unprepared for the ugly middle ground we often currently find ourselves in. But Perl is also unprepared for the eventual "all characters are UTF-8" utopia. But I think that part of the proper way to prepare for that future utopia is to define much better ways for declaring what encoding should be used on the different interfaces. Perl has finally gotten a good start on that when it comes to streams (if anything, there may be too many choices, but that is a good way to figure out what the best choices should be). And Perl has an acceptable start on dealing with the dual nature for its own character strings.
But Perl has yet to define great ways of reconciling Unicode with filenames, environment variables, command lines, usernames, hostnames, etc. And a very simple "all interfaces want UTF-8" option seems like a wise goal to work toward.
And I completely disagree that it is a good thing to force one to separately declare UTF-8ness on every interface. I think it is good to allow such, especially now, if convient. But UTF-8 is not some complex structure like PDF, HTML, JSON, etc. UTF-8 is very much like the choice between Latin-1 and Latin-2 (a choice of locale). It would be best if Perl could just notice "Oh, look, I'm finally running in the 'all is UTF-8' utopia" and work accordingly. I doubt anybody will ever be in a situation where "all streams are HTML", Unix username are HTML, Perl strings know whether they are HTML or just the backward-compatible "plain text", etc.
| [reply] |
Re: Pragma to handle unicode characters
by graff (Chancellor) on Dec 22, 2008 at 04:32 UTC
|
what makes so hard to implement real unicode pragma?
Bear in mind that many people are not ready (or don't need) to pursue the use of unicode; and many of these people are dependent on Perl behaving a particular way with regard to handling i/o that is not strictly limited to ASCII characters.
These people would be severely and negatively surprised if they discovered that by installing the next version of Perl, all of their existing scripts would need to be modified in order to preserve their original behaviors with regard to file i/o.
(That actually happened once, with the introduction of Perl 5.8.0 on RedHat Linux: the particular RedHat release used utf-8 locale settings for the "default" shell environment, and that version of Perl used the locale settings in order to decide what the default i/o layer should be. Mayhem ensued because scripts that had worked previously were suddenly creating garbage. As a result, the Perl 5.8.1 release did not rely directly on locale settings for its default i/o layer selection.) | [reply] |
|
These people would be severely and negatively surprised if they
discovered that by installing the next version of Perl, all of their
existing scripts would need to be modified ...
The use of a dedicated new pragma (disabled by default) wouldn't
bring about such backwards compatibility issues.
It's kind of a
pity, however, that the most intuitive name "use encoding ..." is already taken...
| [reply] [d/l] |
|
Most things answered fellow Almut to you and to others practically the way i think, so thank you Almut. In one aspect i want still point:
(That actually happened once, with the introduction of Perl 5.8.0 on RedHat Linux: the particular RedHat release used utf-8 locale settings for the "default" shell environment, and that version of Perl used the locale settings in order to decide what the default i/o layer should be.
If those scripts were broke after making locale spread default, those script were buggy, right? No one should use locale pragma if they don't mean it, right?
| [reply] |
|
You misunderstood the situation. The victims of the problem never intended to use locale information in any way in their scripts, and the scripts were not written to use locale information. It just suddenly turned out (when the script was run on that particular RedHat release with that particular Perl version) that the use of locale information was imposed on them as "the new default" -- and many of them couldn't figure out why their scripts were suddenly failing until they turned to the community for help.
"Oh, you need to change your shell environment so it doesn't use the new default utf8 locale, and/or you need to change your existing perl scripts..."
As a rule, if you want to build some new functionality into a tool, and this is incompatible in some way with previous functionality that has an established base of users, it's better not to require that those established users change all their code for the sake of the new feature (which they might not have wanted in the first place).
| [reply] |
|
Re: Pragma to handle unicode characters
by almut (Canon) on Dec 22, 2008 at 01:15 UTC
|
(...) what makes so hard to implement real unicode pragma?
I don't have an answer to your question, but I'd just like to point
out another issue that you haven't even touched on: the handling of
file names (such as sämple.txt)...
| [reply] [d/l] |
|
How simple do you want encoding/decoding? Would you like Perl to "automagically" encode/decode JSON? ASN.1? Why, specifically, do you demand it of UTF-8?
The truth is that UTF-8 is a variable-length character encoding method. It's probably a good thing that you have to explicitly decode inputs and encode outputs.. it forces you to know what you are doing.
Update: an example: e-mail - yes, you can send e-mails as UTF-8! But were you aware that MIME headers must be in a 7-bit encoding? In this case blindly opening a socket and telling it to encode all UTF-8 output will severely break your application. It is much better to know specifically when and where encoding is appropriate and permissible..
| [reply] |
|
Would you like Perl to "automagically" encode/decode JSON? ASN.1?
Why, specifically, do you demand it of UTF-8?
It's a matter of convenience, primarily — and in some cases,
transparency (such as having a single point of configuration where the
encoding can be switched, rather than requiring every piece of code to
take care of it on its own).
The comparison to JSON or ASN.1 seems somewhat far-fetched to me.
Unicode is envisaged - and I think widely accepted - to eventually
become the successor of legacy character encodings such as
Latin-1, with their well known limits. And, among the Unicode
encodings, UTF-8 would presumably be a good choice to be used as
the default (because it was specifically designed with backwards
compatibility in mind).
In contrast, JSON / ASN.1 are rather special purpose (and typically not
used as character encodings), so I don't currently see any need
to have similar built-in support for them in Perl.
The truth is that UTF-8 is a variable-length character encoding
method. It's probably a good thing that you have to explicitly decode
inputs and encode outputs.. it forces you to know what you are doing.
Equally (with a hypothetical pure ASCII mind set in place) you could
say: "The truth is that Latin-1 is a (specific) 8-bit character
encoding method. It's probably a good thing that you have to explicitly
decode inputs and encode outputs.. it forces you to know what you are
doing." — Still, we do have Latin-1 semantics by default in Perl...
Just because UTF-8 is variable length doesn't mean it wouldn't be a
sensible choice in environments that otherwise make use of it, in
particular when the programmer explicitly requests that very
functionality using a pragma.
(...) MIME headers must be in a 7-bit encoding
The current 8-bit default for IO could cause just as much potential
breakage as UTF-8 would in this case. I don't think that particular
limits which apply to certain content (or parts thereof) is a good
argument against generally providing a way to conveniently say "I want
UTF-8 to be used as default for all strings/content" (which is what I
think the OP had in mind).
Special cases can be dealt with in the application code. As things are
now, UTF-8 (or, more generally, anything non-Latin-1) is still too
often the "special case", rather than a (configurable!) global default.
| [reply] |
Re: Pragma to handle unicode characters
by borisz (Canon) on Dec 22, 2008 at 00:14 UTC
|
use encoding 'utf8';
does what you want. The pragma change also the PerIO layer for STDIN and STDOUT.
| [reply] [d/l] |
|
The encoding pragma has issues, so I avoid it. Use of the utf8 and open pragmas is more suitable.
| [reply] |
|
I tried it, but there are other worries, warnings "Wide character in print at" and @ARGV is still not treated as UTF-8 chars, uc() does not recognize them as chars
And perldoc utf8 says
In case you are wondering: yes, "use encoding ’utf8’;" works much the same as "use utf8;".
In perl58delta i read also:
New Unicode Semantics (no more use utf8, almost)
Previously in Perl 5.6 to use Unicode one would say "use utf8" and then the operations (like string concatenation) were Unicode-aware in that lexical scope.
So i could not even use "use utf8", theoretically ;)
| [reply] |
Re: Pragma to handle unicode characters
by jwkrahn (Abbot) on Dec 22, 2008 at 04:01 UTC
|
I need, that every possible input and output to/from my script will treated as UTF-8.
Use the open pragma:
use open qw/ :std :utf8 /;
| [reply] [d/l] |
|
That still wouldn't handle "every possible input", such as
@ARGV, file names, and - while we're at it - environment variables.
| [reply] [d/l] |
Re: Pragma to handle unicode characters
by Anonymous Monk on Dec 22, 2008 at 04:34 UTC
|
Have you tried the -C option or, equivalently, the PERL_UNICODE environment variable?
It seems to help, but I don't have enough experience with unicode strings to really test it properly.
$ perl -CSDAL -e 'print "the utf8 flag is ", utf8::is_utf8(shift) ? "o
+n" : "off", " for command-line arguments\n"' hi...the utf8 flag is on
+ for command-line arguments
$ perl -e 'print "the utf8 flag is ", utf8::is_utf8(shift) ? "o
+n" : "off", " for command-line arguments\n"' hi...
the utf8 flag is off for command-line arguments
| [reply] [d/l] [select] |
|
Have you tried the -C option or, equivalently, the PERL_UNICODE environment variable?
No, i was not, now i tried and it helped me. After setting PERL_UNICODE=39 i still (just) need set in my script
use utf8;
use open ':std', ':encoding(UTF-8)';
Thank you, it anwers my first question!
It still does not affect @ENV variables, as applied fellow Almut.
| [reply] [d/l] |
|
You might want to try PERL_UNICODE=63. I think that will let you drop the 'use open' line.
| [reply] |