Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?

Regexp and Linux (is it utf issue?)

by gwene (Initiate)
on Jun 28, 2013 at 06:02 UTC ( #1041149=perlquestion: print w/replies, xml ) Need Help??
gwene has asked for the wisdom of the Perl Monks concerning the following question:

I recently switched from using Active State perl on a Windows box to a Linux machine. At first my perl scripts (I'm a heavy user of regexp) seemed to work fine. But as I started processing more text files ("writer" html files from Libre Office) and started adding more regexp code to my script, I started noticing weird goings-on.

I get bizarre characters Like a strange instead of a dash. So I thought that it may have to do with encoding, but wan't too worried if it's just an occasional weird character popping up. But then my regexp code itself seemed hay-wire -- matching things it shouldn't.

I am very good at regexp language (the main stay of my toolset :) But I am so confused about encoding. I've tried different parameters at the end of my regexp statements, like /u or /a or /d And I've tried
use utf8; use Encode;
And for my filehandles:
binmode(FILE, ":utf8");

Can you please help? I believe my ignorance when it comes to encoding is getting in the way. I simply want my regexp to work the way it used to on a Windows box. If it means I can only use ascii, that's fine with me. I just need to know how :)


Replies are listed 'Best First'.
Re: Regexp and Linux (is it utf issue?)
by aitap (Deacon) on Jun 28, 2013 at 06:21 UTC

    When you read text files, you should decode them. This is easy using PerlIO layers, Encode module and three-argument form of open:

    use Encode; open my $fh, "<:encoding(whatever)", $filename or die $!;
    This way, Perl decodes everything automatically, and you only have to work with characters, not bytes.

    When you write text to files, writing characters produces the famous warning: "wide character in (sub name)...". You need to encode them using the same technique: open my $write, ">:encoding(whatever)", $filename or die $!;. You can use :utf8 layer to encode characters because they are internally stored as valid UTF-8.

    Do not use :utf8 iolayer to decode text because it simply sets "character" flag on the strings read from filehandles without any checks and this is generally unsafe: UTF8 related proof of concept exploit released at T-DOSE.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1041149]
Approved by davido
Discipulus what? Can I approve one of post on my own? weird..

How do I use this? | Other CB clients
Other Users?
Others having an uproarious good time at the Monastery: (7)
As of 2017-10-22 19:36 GMT
Find Nodes?
    Voting Booth?
    My fridge is mostly full of:

    Results (275 votes). Check out past polls.