Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

Re^3: any use of 'use locale'? (source encoding)

by ikegami (Pope)
on Nov 20, 2009 at 17:48 UTC ( #808488=note: print w/ replies, xml ) Need Help??


in reply to Re^2: any use of 'use locale'?
in thread any use of 'use locale'?

#!/usr/bin/perl use locale; print uc("abc"), "\n";

Wait a sec, and aren't found in iso-8859-1. The file you describe can't possibly exist. If the file is encoded using an encoding other than iso-8859-1, you need to tell Perl. use utf8; tells Perl that the source is encoded using UTF-8.

That applies to the OP of this thread too, although it won't change the outcome.

Useful commands:

# Source is UTF-8 use utf8; # Appropriate de/encode data going through STDIN/OUT/ERR. use open ':std', ':locale';


Comment on Re^3: any use of 'use locale'? (source encoding)
Select or Download Code
Re^4: any use of 'use locale'? (source encoding)
by wanradt (Scribe) on Nov 20, 2009 at 22:18 UTC

    I am aware about utf8 in bugreport and in this case here. Cause "use utf8" makes more noise (adds warning "Wide character in print at...") and does not help, i let it out. So, in bug report i intentionally let "use utf8" out, cause my locale is UTF8. I find it being real bug, if i have utf8 locale and i say "use locale", perl does not follow this instruction in every possible way.

    Btw, i find that using "use utf8" is a waste of good thing, if we use it as "when in code is something in utf8". "use utf8" should say: in this pragma any, all and everything you even can imagine, is utf8. Or "use locale" should spread such message, if coder want to be dependent from locale

    Handling all kind of utf8 through lot of different things (open, use, locale, -C, binmode, special keys in regex) makes me always feel sick. Really. I am sorry, but through the 12 years i use Perl, i am just waited, when the unicode things settle down, being simple as that: "use utf8" and everything works. Or similar. Still i see hacks but no systematic solution.

    Yesterday, before posting, i searched last nodes about this topic, and i found this node: 801876. I just hope, that i really misunderstood the point, but if UTF-8 defines something being digit (\d) or word character (\w), then should it be like that in perl too...

    Nnda, WK

      adds warning "Wide character in print at..."

      You can't output characters to STDOUT without instructing Perl how to convert those characters into bytes. I provided the fix for that at the bottom of my previous post.

      So, in bug report i intentionally let "use utf8" out, cause my locale is UTF8.

      Your locale is not used to determine the encoding of the source file. You may see uc("abcõäöüšž") in your editor, but you told Perl uc("abcõäöüšž"). Check the length() of the string for fun...

      but if UTF-8 defines something being digit (\d) or word character (\w), then should it be like that in perl too...

      There are so many problems with your statement.

      • UTF-8 doesn't define anything of the sort.

      • Unicode does define digit and letter properties, but not what Perl allows in identifiers (\w).

      • Perl can match those two Unicode properties (and the other hundred) through \p.

      • Why should Perl only recognize Unicode. What about locales? You yourself want it to recognize locales. What about POSIX? What about what backwards compatibility? What about 99% of the people who use \d and \w to mean /[0-9]/ and /[a-zA-Z0-9_]/? Under Unicode, they would match much more.

      So yeah, Perl allows you to match the Unicode properties. It also allows you do other things too. Sorry, but you don't get to break just about every program that uses \d and \w.

        You can't output characters to STDOUT without instructing Perl how to convert those characters into bytes. I provided the fix for that at the bottom of my previous post.

        I don't think myself being a master in this topic and in monastery here i have seen how deeply you handle unicode area. So, forgive me some inaccuracy, like switching unicode and utf8, and maybe others. I try to give here a picture, which is grown years. So, the situation nowadays, in most of my scripts i have at least such block:

        use utf8; use open ':std' => ':encoding(UTF-8)'; use locale;

        While testing a possibility to cover every possible hole i ended on something like this:

        use utf8; use CGI qw( -utf8 ); use open IO => ':utf8'; use open ':std' => ':encoding(UTF-8)'; binmode STDIN, ":utf8"; binmode STDOUT, ":utf8"; my $dbh = DBI->connect( "dbi:mysql:dbname=$db;host=localhost", $user, +$pwd, {mysql_enable_utf8 => 1 });

        Thruth is, i am not sure, which lines above are dubbed ;) and there above is still not covered areas, i think. ENV and ARGV and file names are still not proof?

        What i am looking for? Instead of reading tons of documentation to put together whole this (ugly) picture above, i'd like to have in perluniintro saying something like this:

        If you have properly set up UTF-8 system, you can just say use utf8_everywhere; # because "use utf8" means something narrower and relax. If your situation needs something more complex, continue reading...

        Or other way, which would be logical to me:

        use locale;

        I understand there are pieces i don't see. But how could "use locale" break a code which does not use it? Or if code use it, what they mean with "use locale", if they don't want to really use locale?

        You say:

        Your locale is not used to determine the encoding of the source file.

        I answer: how sad! Why i then define the locale in my system and ask Perl to use it?

        You can't output characters to STDOUT without instructing Perl how to convert those characters into bytes.

        I answer: but i did! If i have properly defined system locale and i ask Perl to use it, then Perl should know, how to convert characters. Or what i am missing here?

        I'd like to have possibility easily define a scope where everything is treated as utf8. If i say "use locale", then i mean: spread my locale to my code, whatever this locale is. So, any info in this scope is treated as locale needs. People who needs \d == /a-zA-Z0-9_/ don't have to use locale-pragma or even such locale, which defines otherwise. But IMHO, where is needed different approach, it would be easy to adapt.

        I understand, it is wider problem. But for now there is for developers nothing to rely on. For example, for CGI i have explicitly say, i need UTF8. For DBI same. And so on. Why? Because there is no standard place they could look automagically for it, AFAIU. If there would be one big "use utf8_everywhere", which hoist a big flag, every module author could rely on it. Or?

        Such a naive picture i have. I'd like to see weak places in this. I hope, i answered most rised questions, but to be clear:

        Why should Perl only recognize Unicode.

        Not only. But if i ask to use unicode, it should. Simply and anywhere.

        What about locales? You yourself want it to recognize locales.

        Yes, i want. And i don't see contradiction. Whatever coding locale uses, "use locale" should in its scope use it also.

        What about POSIX?

        Sorry, it is over my head.

        What about what backwards compatibility?

        There is my weak point: i don't see, how could it break something. And i just don't see, i understand it may. That is, why i "made" new pragma "utf8_everywhere"

        What about 99% of the people who use \d and \w to mean /0-9/ and /a-zA-Z0-9_/?

        This seems to me simple: they a) don't "use locale" or b) "use locale" with proper system locale. Other uses seems to me buggy.

        Nnda, WK

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://808488]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others contemplating the Monastery: (17)
As of 2014-07-23 19:12 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (151 votes), past polls