Re: Regexp and Linux (is it utf issue?)

by aitap (Deacon)
on Jun 28, 2013 at 06:21 UTC ( #1041153=note: print w/replies, xml ) Need Help??

in reply to Regexp and Linux (is it utf issue?)

When you read text files, you should decode them. This is easy using PerlIO layers, Encode module and three-argument form of open:

use Encode; open my $fh, "<:encoding(whatever)", $filename or die $!;
This way, Perl decodes everything automatically, and you only have to work with characters, not bytes.

When you write text to files, writing characters produces the famous warning: "wide character in (sub name)...". You need to encode them using the same technique: open my $write, ">:encoding(whatever)", $filename or die $!;. You can use :utf8 layer to encode characters because they are internally stored as valid UTF-8.

Do not use :utf8 iolayer to decode text because it simply sets "character" flag on the strings read from filehandles without any checks and this is generally unsafe: UTF8 related proof of concept exploit released at T-DOSE.

node history
Node Type: note [id://1041153]
