Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?

Re^5: any use of 'use locale'? (source encoding)

by ikegami (Pope)
on Nov 20, 2009 at 22:50 UTC ( #808542=note: print w/replies, xml ) Need Help??

in reply to Re^4: any use of 'use locale'? (source encoding)
in thread any use of 'use locale'?

adds warning "Wide character in print at..."

You can't output characters to STDOUT without instructing Perl how to convert those characters into bytes. I provided the fix for that at the bottom of my previous post.

So, in bug report i intentionally let "use utf8" out, cause my locale is UTF8.

Your locale is not used to determine the encoding of the source file. You may see uc("abcõäöüšž") in your editor, but you told Perl uc("abcõäöüšž"). Check the length() of the string for fun...

but if UTF-8 defines something being digit (\d) or word character (\w), then should it be like that in perl too...

There are so many problems with your statement.

  • UTF-8 doesn't define anything of the sort.

  • Unicode does define digit and letter properties, but not what Perl allows in identifiers (\w).

  • Perl can match those two Unicode properties (and the other hundred) through \p.

  • Why should Perl only recognize Unicode. What about locales? You yourself want it to recognize locales. What about POSIX? What about what backwards compatibility? What about 99% of the people who use \d and \w to mean /[0-9]/ and /[a-zA-Z0-9_]/? Under Unicode, they would match much more.

So yeah, Perl allows you to match the Unicode properties. It also allows you do other things too. Sorry, but you don't get to break just about every program that uses \d and \w.

Replies are listed 'Best First'.
Re^6: any use of 'use locale'? (source encoding)
by wanradt (Scribe) on Nov 21, 2009 at 04:28 UTC
    You can't output characters to STDOUT without instructing Perl how to convert those characters into bytes. I provided the fix for that at the bottom of my previous post.

    I don't think myself being a master in this topic and in monastery here i have seen how deeply you handle unicode area. So, forgive me some inaccuracy, like switching unicode and utf8, and maybe others. I try to give here a picture, which is grown years. So, the situation nowadays, in most of my scripts i have at least such block:

    use utf8; use open ':std' => ':encoding(UTF-8)'; use locale;

    While testing a possibility to cover every possible hole i ended on something like this:

    use utf8; use CGI qw( -utf8 ); use open IO => ':utf8'; use open ':std' => ':encoding(UTF-8)'; binmode STDIN, ":utf8"; binmode STDOUT, ":utf8"; my $dbh = DBI->connect( "dbi:mysql:dbname=$db;host=localhost", $user, +$pwd, {mysql_enable_utf8 => 1 });

    Thruth is, i am not sure, which lines above are dubbed ;) and there above is still not covered areas, i think. ENV and ARGV and file names are still not proof?

    What i am looking for? Instead of reading tons of documentation to put together whole this (ugly) picture above, i'd like to have in perluniintro saying something like this:

    If you have properly set up UTF-8 system, you can just say use utf8_everywhere; # because "use utf8" means something narrower and relax. If your situation needs something more complex, continue reading...

    Or other way, which would be logical to me:

    use locale;

    I understand there are pieces i don't see. But how could "use locale" break a code which does not use it? Or if code use it, what they mean with "use locale", if they don't want to really use locale?

    You say:

    Your locale is not used to determine the encoding of the source file.

    I answer: how sad! Why i then define the locale in my system and ask Perl to use it?

    You can't output characters to STDOUT without instructing Perl how to convert those characters into bytes.

    I answer: but i did! If i have properly defined system locale and i ask Perl to use it, then Perl should know, how to convert characters. Or what i am missing here?

    I'd like to have possibility easily define a scope where everything is treated as utf8. If i say "use locale", then i mean: spread my locale to my code, whatever this locale is. So, any info in this scope is treated as locale needs. People who needs \d == /a-zA-Z0-9_/ don't have to use locale-pragma or even such locale, which defines otherwise. But IMHO, where is needed different approach, it would be easy to adapt.

    I understand, it is wider problem. But for now there is for developers nothing to rely on. For example, for CGI i have explicitly say, i need UTF8. For DBI same. And so on. Why? Because there is no standard place they could look automagically for it, AFAIU. If there would be one big "use utf8_everywhere", which hoist a big flag, every module author could rely on it. Or?

    Such a naive picture i have. I'd like to see weak places in this. I hope, i answered most rised questions, but to be clear:

    Why should Perl only recognize Unicode.

    Not only. But if i ask to use unicode, it should. Simply and anywhere.

    What about locales? You yourself want it to recognize locales.

    Yes, i want. And i don't see contradiction. Whatever coding locale uses, "use locale" should in its scope use it also.

    What about POSIX?

    Sorry, it is over my head.

    What about what backwards compatibility?

    There is my weak point: i don't see, how could it break something. And i just don't see, i understand it may. That is, why i "made" new pragma "utf8_everywhere"

    What about 99% of the people who use \d and \w to mean /0-9/ and /a-zA-Z0-9_/?

    This seems to me simple: they a) don't "use locale" or b) "use locale" with proper system locale. Other uses seems to me buggy.

    Nnda, WK

      Yes, Unicode support is not as good as it could be. We are in a transition phase from ASCII, the various ISO-8859-n encodings, and several multibyte encodings to Unicode. ASCII is about 35 years older than Unicode, the ISOs are still about 10 years older. The biggest problem of Unicode is that a char is no longer the same as a byte, which breaks at least 35 years of code. (At my current job, nobody knows Unicode. They still talk about ASCII, and will continue to do so for at least the next decade. So, introducing Unicode breaks 40 to 50 years of code.) And to make things worse, all Unicode encodings except for UTF-8 typically contain lots of NUL bytes, breaking even more code that expects NUL bytes only at the end of strings.

      A CPU (and all of the other hardware) has no problems with Unicode. It's not a hardware problem at all. So, the problem must start at the operating system:

      Nearly all of our current and legacy file systems assume that a char and a byte are the same, and often they also assume that a NUL byte marks the end of a filename. So, we need to change the filesystems. Very often, UTF-8 can be used instead of ASCII, leaving only some problems of byte lengths vs. character lengths and of all those old byte-based characters above 0x7F. In fact, we need to know what encoding is used for each filename, or at least for each instance of each filesystem. The operating system needs to take care of the different encodings, and offer a Unicode-based API for the filesystems. Windows has ASCII and Wide APIs for this purpose, but as far as I understand, Wide means UCS-2, which is only a subset of UTF-16 and does not cover the entire Unicode set. ASCII has no support for Unicode. I'm not quite sure weather Linux has an 8-bit-transparent API that is able to pass UTF-8 or has a real UTF-8-based API.

      So, now that we can have filenames and especially directory names in Unicode, $ENV{'PATH'} must be able to contain Unicode characters, and some other environment variables, too. So, we need a Unicode environment, preferably with support for Unicode keys. As far as I understand, Windows offers a UCS-2 environment to "Unicode" programs and an ASCII environment for non-Unicode programs. Linux provides an 8-bit-clean environment and lets each program decide about the encoding of the environment.

      As for the environment, the command line arguments must be able to contain Unicode. The same game here, Linux passes a NUL-terminated array of bytes and lets the program decide about the encoding, Windows offers two APIs depending on how the program was compiled and linked.

      All of those really basic things about running a program are not yet complete. I simply do not know any operating system that treats each and every string passed to its APIs as Unicode.

      A completely different problem are text files of all kind, starting with what we call "plain text", scripts, source code, logs and so on. For each text file we read or write, we need to know its encoding. Current operating systems can not give us the slightest hint about the encoding. HTML and XML have a default encoding and may contain hints about a different encoding. So, I/O in text mode is a huge and unsolved problem.

      Networking: IP, TCP and UDP are all about stuffing bytes into tubes and collecting those that fall out of other tubes. ;-) No problem so far. The problems arise at higher levels, where the protocols start working with text strings. Think about the unfortunate punycode used in DNS. Think about e-mail accounts. E-Mail and HTTP have at least a Charset header, solving the problem of the content. But headers are still ASCII. E-Mail-Adresses are passed in the header. Think about FTP. I don't know how FTP would or should handle Unicode filenames.

      If we could throw away all old and existing systems and simply start a new set of operating systems, file systems and network protocols, everything would be easy and simple: Store a charset (and a content-type) with each and every file, and use some Unicode encoding instead of ASCII.

      Some newer languages took their advantage of not having legacy sources. Perl is older than Unicode, and has a big legacy of old code that has to be supported. Perl 5 is about as old as Unicode, but Unicode was simply not relevant when Perl 5 was released.

      Sure, it would have been nice to have Perl 5.000 with full Unicode support, but what operating system would have been able to run it?

      What operating system can currently provide perl with a complete Unicode environment (%ENV, @ARGV, STDIN, STDOUT, STDERR, open, opendir, mkdir, rmdir, unlink, ...)?

      All Unicode problems are still transition problems. Your hypothetic "everything-is-Unicode"-flag could be implemented some day, when all Perl Module authors (or at least those of the major modules) have changed their code to fully support Unicode, and when Perl can use a Unicode API on all major operating systems.

      Look at DBI and the various DBDs. The first DBI version having a little bit of Unicode support is 1.38 dated 2003-Aug-21. DBD::Oracle got some Unicode support in 1.13 dated 2003-Mar-14, but to get real Unicode support, you needed at least Oracle 9, released in 2001. DBD::Pg got Unicode support with version 1.22, dated 2003-Mar-26. DBD::ODBC had no Unicode support at all until I started messing with its code and the Windows API and published a patch 2006-Mar-03. After some discussions on dbi-users, Martin J. Evans cleaned up after me and released DBD::ODBC 1.14 dated 2007-Jul-17 with minimal Unicode support. DBD::mysql got the first parts of its Unicode support in 3.0004_1, dated 2006-May-17.

      And now, file APIs. Perl on Windows uses the ASCII APIs for file I/O, probably because using the Unicode APIs would break lots of code, especially when it comes to command line arguments and the environment. And perhaps because until recently, Perl supported Windows 9x lacking the several parts of the Unicode APIs. On other systems, there aren't even APIs where programs can talk in Unicode with the operating system.

      So, what can be done?

      • Try to find Unicode APIs. ODBC was easy, because it already existed and was (kind of) documented.
      • Try to get Unicode APIs implemented. Again, ODBC was easy because it was already done. Operating systems will be hard, because you need to change everything: kernel APIs, process structures, file systems, shells, standard utilities.
      • Provide patches and tests to get the Unicode API s implemented.
      • Provide patches and tests to get Perl and Module code ported to the Unicode APIs.
      • If you can not do that, make people talk about the problems. Try to get them in a room and let them find a solution.
      • Or find a sponsor that pays someone to solve a problem. I wrote my Unicode patch during my work hours, simply because my work project needed it. After a short discussion with my boss ("We took so much from the community, now let's pay back a little by publishing that patch - it does not harm anybody and does not expose any of our secrets"), I got the permission to publish it.

      We won't be able to make a big jump forward, flip a switch and have all Unicode problems solved. But we can make small steps. Every journey begins with a single step.

      Expect a few more years until Unicode has truely become universal, and a few more years for all code writers to keep up. I think that the major problems at the O/S and network level need to be solved first, before we can change Perl. Windows could be a good test environment, because it already has Unicode APIs.


      Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)

        Thank you, Alexander! It was widening, but did not answered to my most important questions. You answered like i suppress unicode to everyone and to everywhere. That's not my goal.

        I'd like to have a sandbox in Perl, where unicode were treated naturally.

        Trying make it more clear. I am not familiar with Perl history so good, but let me make assumption, that in some phase there was no strict-pragma. OK? Then someone thought, it may be good idea and found ways to implement it. Did that break any earlier code? I don't think so. But it made available widely use strict pragma.

        So i am talking now. As far as i see, for module authors is there no possibility to see, do the module caller uses utf8 or not. Am i correct? And, does it break any earlier code, if they would have such a possibility? That would be a single step, IMHO :)

        What operating system can currently provide perl with a complete Unicode environment (%ENV, @ARGV, STDIN, STDOUT, STDERR, open, opendir, mkdir, rmdir, unlink, ...)?

        I have not deeply investigated, how unicode-proof is Linux for now, but on system level i have'nt any complains already years (Debian and Kubuntu). If you could me give some hints, how determine unicode use, i'd like to test it.

        Nnda, WK
      You say:
      Your locale is not used to determine the encoding of the source file.
      I answer: how sad! Why i then define the locale in my system and ask Perl to use it?

      Why do you think locale should be used to determine source file encoding? What if you created script using utf8 and I'm running it in latin1 locale, or utf16?

        Then you override by using :encoding. That's already how it works with use open.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://808542]
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others pondering the Monastery: (5)
As of 2018-06-18 04:32 GMT
Find Nodes?
    Voting Booth?
    Should cpanminus be part of the standard Perl release?

    Results (107 votes). Check out past polls.