Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Re^2: Modern best practices for multilingual regexp alphabetical character matching?

by mea (Initiate)
on Jan 26, 2009 at 00:22 UTC ( #738846=note: print w/ replies, xml ) Need Help??


in reply to Re: Modern best practices for multilingual regexp alphabetical character matching?
in thread Modern best practices for multilingual regexp alphabetical character matching?

Dear Monks,

Sorry to introduce myself by hijacking an old thread, but I have some related questions. I am a complete beginner and this topic confuses me the most. I didn't realize the problem until I used some automatic match variables ($` $& $') and parentheses. The output encoding which was fine until then broke. Following your advice and with trial-error I found that putting :

while (<>) { $_ = Encode::decode_utf8( $_ ); binmode STDOUT, ":utf8";
to the input corrects the encoding. It is strange that without these lines on the input, everything "looks" fine unless I use parentheses or automatic match variables. Is the encoding wrong all the way and somehow gets corrected on the output? Or is it correct and the automatic variables and parentheses break it? Considering that I work only with utf-8 files, should I make a habit of putting these lines every time I use input?

Best regards,

Martin


Comment on Re^2: Modern best practices for multilingual regexp alphabetical character matching?
Download Code
Re^3: Modern best practices for multilingual regexp alphabetical character matching?
by moritz (Cardinal) on Jan 26, 2009 at 07:27 UTC
    Everything "looks" fine until you try to extract substrings in some way. That's because without decoding your data on input the strings are handled as sequences of bytes, so a character like translates to two bytes.

    Now if you extract some part of string and didn't decoded it first, you can accidentally rip apart these two bytes, leaving behind encoding garbage - usually not a good idea.

    So I recommend to properly decode UTF-8 (and other character encodings) during input, and encode the strings on output. And use utf8; if you have string constants in your source code.

      Thanks for the answer.

      So basically I am safe as long as I use these two lines, and "use utf8;" on top of my script every time. This has been really the most confusing thing so far, the otherwise excellent "Learning Perl" doesn't mention these problems at all, and some of the examples don't work correctly with utf-8 characters. which is fine for English speaking beginners, but people working on other languages have to deal with this issue right from the start. Could have saved a lot of time if it mentioned simply "for non-English languages or utf-8 add this to your script". Well, at least now I know and can go back to learn the "proper" stuff... Thanks again,

      Best Regards,

      Martin

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://738846]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others meditating upon the Monastery: (8)
As of 2014-09-22 12:49 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (191 votes), past polls