Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister

Re: utf8, locale and regexp

by ruoso (Curate)
on Apr 11, 2007 at 10:35 UTC ( #609331=note: print w/replies, xml ) Need Help??

in reply to utf8, locale and regexp

I usually like to say: The correct way of handling encodings in Perl is not caring about. If you're caring too much, you're doing the wrong way...

The only two things you need to do to work properly with whatever-encoding in Perl is:

  • Tell Perl the encoding of your Inputs and of your Outputs.
  • Tell Perl the encoding of your source file.

The match of accented characters in regexps doesn't have nothing to do with encoding at all, just with locale, so, if your locale is set correctly, then the match will work, in whatever-encoding.

This way, the code you sent would be like the following (I included some more CGI code to exemplify your case).

use strict; use warnings; use CGI; # this tells my source file is UTF-8 use utf8; # the latin accented characters are valid # for this locale, for instance. BEGIN { $ENV{LC_CTYPE} = 'pt_BR' } # tell Perl I want it to consider that use locale; # The good thing about CGI is that it already # honor the input encoding, so you don't need # to care. my $cgi = CGI->new(); my $string = q( aA ); # this match works because of the use locale, # not because of encodings... $string =~ s//b/g; # now two important things: # the first is to tell Perl that your STDOUT # is utf8 (this may not be the default depending # on the operating system, the environment and a # lot of other stuff). So it's better to do it # explicitly. binmode STDOUT, ':utf8'; # The second is to properly say that to the browser # (this is actually HTTP specific, not exactly Perl # related, but, as you said you're working with CGI # I decided to cite here). print $cgi->header(-type => 'text/plain', -charset => 'utf-8'); # then the string will be printed correctly print $string;
Hope this helps... Update: I missed "-type => " in the first version...

Replies are listed 'Best First'.
Re^2: utf8, locale and regexp
by almut (Canon) on Apr 11, 2007 at 17:39 UTC
    The correct way of handling encodings in Perl is not caring about. If you're caring too much, you're doing the wrong way...

    I wish I could agree with this statement... but I'm afraid I can't.

    During the last few months at work, I've been involved in a number of Perl projects in Japanese and Chinese environments, where correct handling of encodings is of paramount importance (in particular on Windows, with its unholy mixture of encodings, like UCS-2, UTF-8 and various legacy codepages.) During that time, I've run into several encoding issues, where you just have to "care too much" (to use your words), or else things simply won't work.

    For one, Perl doesn't (yet) provide any convenient abstraction layer for handling file names (as opposed to file contents), which means you have to take care of everything yourself manually (by writing wrapper functions, using Encode::(en|de)code explicitly, etc.). In case you're interested in the details, look here for the kind of things I'm having in mind.

    This isn't the only problem, though. There are a few "borderline" bugs, like the one I posted recently, in the hope to get some feedback on whether other people would also consider this a bug. (Didn't work out, btw. Not a single reply -- which makes me conclude that, with respect to unicode issues, there's not exactly an overwhelming amount of interest in the Perl community. Kind of a pity, but such is life.). Anyway, what I mean to say is that, having to figure out that you need to specify :raw:encoding(ucs-2le):crlf:utf8 to read/write ordinary UCS-2 files (as frequently encountered on Windows platforms) is just a bit "having to care too much" for my taste...  Not to forget the bug revealed in this thread, and other oddities related to subtle differences between use utf8 and use encoding 'utf8', for example.

    Of course, whether something is a bug, always is kind of subjective, as it largely depends on your expectations of how things should work, but I think we're not doing ourselves a favor to pretend that everything encoding-related in Perl is working without hassles...

    Sorry for the rant, and don't get me wrong. I'm a big fan of Perl, and I would surely advocate Perl wherever appropriate. However, in one of the projects mentioned above, I've had a rather hard time convincing my clients to stick with Perl, and not switch to some other language altogether. This involved investing quite a few unpaid hours on my side (spent on debugging and working around various peculiarities) to keep the price competitive.

    Hope you can forgive the somewhat emotional tone of this post. In any case it's not meant to attack you personally, ruoso. Just needed to vent a little... and I'm feeling better now :)

      Perl's unicode support is far from complete - especially when you consider outside-the-base-distro modules that everyone relies on. I've been running into bugs in DBD::mysql myself. I even supplied a couple of patches. Right now I would say that perl's unicode is better than most programming languages, if you only look at the base language.

      I believe perl's internal distinguishing between the 8bit (latin-1) endocing and internal, multibyte (utf8) representation is the right choice for a language that has to keep strings == bytearray backward compatibility. It also keeps C <-> perl translation relatively straightforward.

      Also, I must say I've not run into any unicode bugs in perl since 7 months ago, when started working on a fairly large multi-language system.

      But like I said, it's not quite like that when you consider modules. Most modules on CPAN aren't under the kind of scrutiny that the base perl distro is under. Right now I'm examining a DBD::mysql bug that seems to not affect the system I'm working on, but I can't figure out why it doesn't. <--- that means; no one's going to pay me for fixing it, probably. :-)

      Actually, your post was very much informative. Thank you. And yes, I do believe the points you made are about bugs (specially the crlf issue).

      As to filenames, I think this is a wishlist bug to File::Spec, as far as I understand, File::Spec should be able to deal with the encoding used in the operating system also (or at least be able to receive the information about which encoding to use).


Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://609331]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others perusing the Monastery: (5)
As of 2020-11-29 20:32 GMT
Find Nodes?
    Voting Booth?

    No recent polls found