Beefy Boxes and Bandwidth Generously Provided by pair Networks Frank
Perl Monk, Perl Meditation
 
PerlMonks  

Re^3: Modern best practices for multilingual regexp alphabetical character matching?

by moritz (Cardinal)
on Jan 26, 2009 at 07:27 UTC ( #738869=note: print w/ replies, xml ) Need Help??


in reply to Re^2: Modern best practices for multilingual regexp alphabetical character matching?
in thread Modern best practices for multilingual regexp alphabetical character matching?

Everything "looks" fine until you try to extract substrings in some way. That's because without decoding your data on input the strings are handled as sequences of bytes, so a character like translates to two bytes.

Now if you extract some part of string and didn't decoded it first, you can accidentally rip apart these two bytes, leaving behind encoding garbage - usually not a good idea.

So I recommend to properly decode UTF-8 (and other character encodings) during input, and encode the strings on output. And use utf8; if you have string constants in your source code.


Comment on Re^3: Modern best practices for multilingual regexp alphabetical character matching?
Select or Download Code
Re^4: Modern best practices for multilingual regexp alphabetical character matching?
by mea (Initiate) on Jan 26, 2009 at 09:59 UTC

    Thanks for the answer.

    So basically I am safe as long as I use these two lines, and "use utf8;" on top of my script every time. This has been really the most confusing thing so far, the otherwise excellent "Learning Perl" doesn't mention these problems at all, and some of the examples don't work correctly with utf-8 characters. which is fine for English speaking beginners, but people working on other languages have to deal with this issue right from the start. Could have saved a lot of time if it mentioned simply "for non-English languages or utf-8 add this to your script". Well, at least now I know and can go back to learn the "proper" stuff... Thanks again,

    Best Regards,

    Martin

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://738869]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others cooling their heels in the Monastery: (9)
As of 2014-04-17 11:27 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    April first is:







    Results (446 votes), past polls