comment on

Fellow devoted, on path of mine have i two bound questions which rise after time to time and i have not found clear answers to them.

First is simple, practical one. I need, that every possible input and output to/from my script will treated as UTF-8. So i made a test-script which (through the wild and hard ways) almost satisfies this criterion. Still i can't get properly handled command line arguments, i still had to use decode on @ARGV. So, the question: how should i get the @ARGV properly treated and is there simpler way to handle input/output than i did in script below?

#!/usr/bin/perl

use strict;
use warnings;
use locale;
use utf8;
use Encode;

binmode STDIN, ":utf8";
binmode STDOUT, ":utf8";

# test non-default output too
open(OUT, ">utf8", "sample_out.txt");

print "Command line argument: \n";
my $t = $ARGV[0];
&output_string($t, decode("utf-8", $t));

print "Enter some umlaut, please: ";
$t = <STDIN>; # 蹁鳇殲 is good input to test
chomp($t);
&output_string($t);

print "Variable from source code: \n";
$t = "蹁鳇殲";
&output_string($t);

print "String from file: \n";
open(IN, "<utf8", "sample.txt");
$t = <IN>;
close(IN);
chomp($t);
&output_string($t);

close(OUT);


sub output_string {
    my ($str) = shift;
    my ($dstr) = shift || '';
    my ($ustr) = uc($str);
    print length($str), " $str $ustr $dstr\n\n";
    print OUT "$str $ustr $dstr\n";
}


__END__

sample.txt contains the same string:
蹁鳇殲
[download]

And second one, assuming that my script is based on right understanding of status quo in Perl: Why is UTF-8 string handling so painful in Perl?

I try to explain, how i see things.

In Perl we have good things - pragmas. So when i tell to my script, hey, i need to make everything look like it is common to my location, i just say "use locale;" If i have properly set up system locale, it should spread to my program too. In reality i can't see such thing. In this example script above is no difference using locale or not. Did i told i have it set? With Posix setlocale i checked out that perl sees my locale (et_EE.UTF-8) but it seems have no influence to input/output chain or character-handling. I hoped, that maybe we have bug in our system locale, but there was no change when i used different locales with UTF-8 support. So, i found, i can't rely on "use locale" and it is sad.

Then (and this was even on last century) i found other pragma - utf8. It was good day. But not for long time, cause it did not make what i hoped. Pod says:

The "use utf8" pragma tells the Perl parser to allow UTF-8 in the program text in the current lexical scope

So basically this does not change so much and is good for beginners like me, so i am not forced to separate program logic and content strings. It does not have power to handle IO. So this pragma did not help me too.

On the way to get things to work with UTF-8 i learned some tricks or hacks, but i don't see the systematic solution. I'd like to see that Some Pragma just makes every string in its lexical scope appear as unicode and that all the IO is also unicode proof. As concept it seems to me so easy :) In manuals i read something like "if parser sees wide character the utf-flag is turned on". Why? What harm it may make when user defines a scope to be fully unicoded and every piece is treated as unicode? No fears, no doubts, no need to check strings against some tests. It seems so simple to me that doubts rise and i must admit: it is almost sure i miss some piece from big picture.

So, after using super search here too and after reading some pods i'd like to ask: what makes so hard to implement real unicode pragma?

N鮪da, WK

In reply to Pragma to handle unicode characters by wanradt

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


There's more than one way to do things
	PerlMonks