Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses

Regexp and Linux (is it utf issue?)

by gwene (Initiate)
on Jun 28, 2013 at 06:02 UTC ( #1041149=perlquestion: print w/replies, xml ) Need Help??
gwene has asked for the wisdom of the Perl Monks concerning the following question:

I recently switched from using Active State perl on a Windows box to a Linux machine. At first my perl scripts (I'm a heavy user of regexp) seemed to work fine. But as I started processing more text files ("writer" html files from Libre Office) and started adding more regexp code to my script, I started noticing weird goings-on.

I get bizarre characters Like a strange †instead of a dash. So I thought that it may have to do with encoding, but wan't too worried if it's just an occasional weird character popping up. But then my regexp code itself seemed hay-wire -- matching things it shouldn't.

I am very good at regexp language (the main stay of my toolset :) But I am so confused about encoding. I've tried different parameters at the end of my regexp statements, like /u or /a or /d And I've tried
use utf8; use Encode;
And for my filehandles:
binmode(FILE, ":utf8");

Can you please help? I believe my ignorance when it comes to encoding is getting in the way. I simply want my regexp to work the way it used to on a Windows box. If it means I can only use ascii, that's fine with me. I just need to know how :)


Replies are listed 'Best First'.
Re: Regexp and Linux (is it utf issue?)
by aitap (Deacon) on Jun 28, 2013 at 06:21 UTC

    When you read text files, you should decode them. This is easy using PerlIO layers, Encode module and three-argument form of open:

    use Encode; open my $fh, "<:encoding(whatever)", $filename or die $!;
    This way, Perl decodes everything automatically, and you only have to work with characters, not bytes.

    When you write text to files, writing characters produces the famous warning: "wide character in (sub name)...". You need to encode them using the same technique: open my $write, ">:encoding(whatever)", $filename or die $!;. You can use :utf8 layer to encode characters because they are internally stored as valid UTF-8.

    Do not use :utf8 iolayer to decode text because it simply sets "character" flag on the strings read from filehandles without any checks and this is generally unsafe: UTF8 related proof of concept exploit released at T-DOSE.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1041149]
Approved by davido
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others surveying the Monastery: (6)
As of 2017-08-17 18:27 GMT
Find Nodes?
    Voting Booth?
    Who is your favorite scientist and why?

    Results (290 votes). Check out past polls.