|Just another Perl shrine|
Pragma to handle unicode charactersby wanradt (Scribe)
|on Dec 21, 2008 at 23:57 UTC||Need Help??|
wanradt has asked for the
wisdom of the Perl Monks concerning the following question:
Fellow devoted, on path of mine have i two bound questions which rise after time to time and i have not found clear answers to them.
First is simple, practical one. I need, that every possible input and output to/from my script will treated as UTF-8. So i made a test-script which (through the wild and hard ways) almost satisfies this criterion. Still i can't get properly handled command line arguments, i still had to use decode on @ARGV. So, the question: how should i get the @ARGV properly treated and is there simpler way to handle input/output than i did in script below?
And second one, assuming that my script is based on right understanding of status quo in Perl: Why is UTF-8 string handling so painful in Perl?
I try to explain, how i see things.
In Perl we have good things - pragmas. So when i tell to my script, hey, i need to make everything look like it is common to my location, i just say "use locale;" If i have properly set up system locale, it should spread to my program too. In reality i can't see such thing. In this example script above is no difference using locale or not. Did i told i have it set? With Posix setlocale i checked out that perl sees my locale (et_EE.UTF-8) but it seems have no influence to input/output chain or character-handling. I hoped, that maybe we have bug in our system locale, but there was no change when i used different locales with UTF-8 support. So, i found, i can't rely on "use locale" and it is sad.
Then (and this was even on last century) i found other pragma - utf8. It was good day. But not for long time, cause it did not make what i hoped. Pod says:
The "use utf8" pragma tells the Perl parser to allow UTF-8 in the program text in the current lexical scopeSo basically this does not change so much and is good for beginners like me, so i am not forced to separate program logic and content strings. It does not have power to handle IO. So this pragma did not help me too.
On the way to get things to work with UTF-8 i learned some tricks or hacks, but i don't see the systematic solution. I'd like to see that Some Pragma just makes every string in its lexical scope appear as unicode and that all the IO is also unicode proof. As concept it seems to me so easy :) In manuals i read something like "if parser sees wide character the utf-flag is turned on". Why? What harm it may make when user defines a scope to be fully unicoded and every piece is treated as unicode? No fears, no doubts, no need to check strings against some tests. It seems so simple to me that doubts rise and i must admit: it is almost sure i miss some piece from big picture.
So, after using super search here too and after reading some pods i'd like to ask: what makes so hard to implement real unicode pragma?