Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??

Fellow devoted, on path of mine have i two bound questions which rise after time to time and i have not found clear answers to them.

First is simple, practical one. I need, that every possible input and output to/from my script will treated as UTF-8. So i made a test-script which (through the wild and hard ways) almost satisfies this criterion. Still i can't get properly handled command line arguments, i still had to use decode on @ARGV. So, the question: how should i get the @ARGV properly treated and is there simpler way to handle input/output than i did in script below?

#!/usr/bin/perl use strict; use warnings; use locale; use utf8; use Encode; binmode STDIN, ":utf8"; binmode STDOUT, ":utf8"; # test non-default output too open(OUT, ">utf8", "sample_out.txt"); print "Command line argument: \n"; my $t = $ARGV[0]; &output_string($t, decode("utf-8", $t)); print "Enter some umlaut, please: "; $t = <STDIN>; # õäöüšž is good input to test chomp($t); &output_string($t); print "Variable from source code: \n"; $t = "õäöüšž"; &output_string($t); print "String from file: \n"; open(IN, "<utf8", "sample.txt"); $t = <IN>; close(IN); chomp($t); &output_string($t); close(OUT); sub output_string { my ($str) = shift; my ($dstr) = shift || ''; my ($ustr) = uc($str); print length($str), " $str $ustr $dstr\n\n"; print OUT "$str $ustr $dstr\n"; } __END__ sample.txt contains the same string: õäöüšž

And second one, assuming that my script is based on right understanding of status quo in Perl: Why is UTF-8 string handling so painful in Perl?

I try to explain, how i see things.

In Perl we have good things - pragmas. So when i tell to my script, hey, i need to make everything look like it is common to my location, i just say "use locale;" If i have properly set up system locale, it should spread to my program too. In reality i can't see such thing. In this example script above is no difference using locale or not. Did i told i have it set? With Posix setlocale i checked out that perl sees my locale (et_EE.UTF-8) but it seems have no influence to input/output chain or character-handling. I hoped, that maybe we have bug in our system locale, but there was no change when i used different locales with UTF-8 support. So, i found, i can't rely on "use locale" and it is sad.

Then (and this was even on last century) i found other pragma - utf8. It was good day. But not for long time, cause it did not make what i hoped. Pod says:

The "use utf8" pragma tells the Perl parser to allow UTF-8 in the program text in the current lexical scope
So basically this does not change so much and is good for beginners like me, so i am not forced to separate program logic and content strings. It does not have power to handle IO. So this pragma did not help me too.

On the way to get things to work with UTF-8 i learned some tricks or hacks, but i don't see the systematic solution. I'd like to see that Some Pragma just makes every string in its lexical scope appear as unicode and that all the IO is also unicode proof. As concept it seems to me so easy :) In manuals i read something like "if parser sees wide character the utf-flag is turned on". Why? What harm it may make when user defines a scope to be fully unicoded and every piece is treated as unicode? No fears, no doubts, no need to check strings against some tests. It seems so simple to me that doubts rise and i must admit: it is almost sure i miss some piece from big picture.

So, after using super search here too and after reading some pods i'd like to ask: what makes so hard to implement real unicode pragma?

Nõnda, WK

In reply to Pragma to handle unicode characters by wanradt

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others lurking in the Monastery: (4)
As of 2024-04-23 15:44 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found