Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

Converting everything (MySql, perl, CGI, website) to UTF-8

by jfrm (Monk)
on Mar 16, 2018 at 08:01 UTC ( #1211014=perlmeditation: print w/replies, xml ) Need Help??

In order to deal with Japanese orders, I recently had to convert my whole system to UTF-8. A day or 2's job I thought. 2.5 weeks later, I'm finally there. There is a lot of stuff on Perlmonks and the internet in general about this but it is hard to understand and even harder to implement. Most of the advice I read was along the lines of RTFM or did not give the whole story. It's pretty clear this is a common problem, too. I wanted to give something back to the community as perlmonks has helped me a lot, so I thought I would share some insights that I hope will be practical and useful.

There is a lot out there telling you to used decode/encode and giving lectures on internal representation of UTF8 in Perl and wotnot. In the end I've only had to use decode in one place where data is coming in from elsewhere. If you get all the other stuff right, I believe you shouldn't need any or many instances of decode/encode.

Our system involves a local website using MySQL, a live website, static webpages, generated webpages, various text files and CGI website forms. All of this needs work to make it work. Here are the things that I needed to do:

Checklist of changes to make

* Firstly, every script file is converted to UTF-8 format. Easy.

* Every script to have this at the top: use utf8; This tells perl that the script itself is in UTF format. So a in the script will be interpreted as a UTF-8 . It's no good just putting this in the calling script as it only seems to extend for the scope of the script underneath; not any other scripts that are imported with require...

* Ideally each database table must be turned to UTF-8 format. This turns out to be difficult and time-consuming because any tables with foreign keys won't convert unless you first delete the foreign keys. For those that won't easily convert, you can convert only the fields that might hold UTF-8 encoded characters to UTF-8 format. Also BLOB fields are a problem unless the whole table is UTF-8. I had to convert problem BLOB fields to TEXT fields and then convert them to UTF-8 format (a 2 step process, doing both in 1 step fails).

* Rose::DB (or whatever database method you are using) needs to be told that incoming data from the Database is in UTF-8. For Rose:DB, add this to the connector in DB.pm and then regenerate connect_options => {mysql_enable_utf8 => 1}

* binmode(STDOUT, ":utf8"); # Put this at the top of a script - tells it to output UTF to stdout. Not sure if this is just needed only once in the opening script or in any requires, too?

* Webpages must have this in the head section: <meta http-equiv="content-type" content="text/html; charset=UTF-8">

* use CGI qw(-utf8); to treat incoming CGI parameters as UTF-8. Getting this working was subtle - test carefully.

* When outputting a CGI webpage, the first thing to do is to output the http header and this needs to be told about UTF8 too: Personally I found that print header(-type=>'text/html', -cookie=>'', -charset=>'utf-8'); gave problems with cookies so ended up outputting it direct: print "Content-type: text/html; charset=utf-8\n$cookie\n\n";

* use open ':encoding(utf8)'; # tells it to deal with all files in a UTF8 way. In fact, I was more careful with this and did not use it in general. Instead, I have specifically opened each file that needed it with open($fh, '<:encoding(UTF-8)', $filename);. Because some files that I have to deal with have not been given to me in UTF-8 format. Careful - this can fail if the $filename variable is not also in UTF8!

Identifying Errors

In doing this, you will make mistakes and see weird characters appearing in unexpected places. I developed my own personal understanding of how to deal with them. These are my own notes for practical situations so please bear with me, if the explanations are not exactly correct - it was about fixing stuff not being a perl rocket scientist.

  • You see displayed as '£'
    • If sign is coming from dbase and is stored correctly in dbase and webpage is correctly displaying UTF-8 characters from elsewhere (e.g. write japanese text into the perl script and print it), then the UTF-8 is not being retrieved from the database as UTF-8 (presumably being assumed to be Latin1).
    • The is within a UTF-8 encoded PERL script but use utf8; is not set at the top of the script.
    • The is displayed correctly in a form initially but when the form is saved/updated, the then displays as '£'. Use the -utf8 CGI pragma to treat incoming parameters as UTF-8: use CGI ('-utf8');
  • is displayed on a webpage as �
    • This happens when the http header Content Type is not UTF8 and the meta tag is similarly <meta http-equiv="Content-Type" content="text/html" />
  • or other characters are being displayed as a diamond with ? inside it
    • StackOverflow:...usually the sign of an invalid (non-UTF-8) character showing up in an output (like a page) that has been declared to be UTF-8. Can be fixed by putting the following at the top of script: binmode(STDOUT, ":utf8");
  • Error message: Wide character in print
    • Means a print statement (to STDOUT or a file) that is outputting Latin1 includes a UTF-8 character... To fix, add '>:encoding (UTF-8) to the open statement or #binmode(STDOUT, ":utf8");

Replies are listed 'Best First'.
Re: Converting everything (MySql, perl, CGI, website) to UTF-8
by choroba (Bishop) on Mar 16, 2018 at 08:33 UTC
  • You can use <li> to make bullets, they look much nicer than asterisks (see this post for an example).

  • use UTF;

    You probably mean

    use utf8;

  • Note that in MySQL, you can configure the encoding used for:
    • storing the data
    • sending the data
    • getting the data
    plus various defaults in case any of the specific ones wasn't specified.

  • Note that :utf8 and :encoding(UTF-8) layers aren't identical. The latter is more strict and does more checks.

  • this can fail if the $filename variable is not also in UTF8

    The encoding of the contents is in no way related to the encoding of the filename. Moreover, at least in a web app, the filename on your system should be controlled by you, not a user input, so you should always know what encoding the filename uses.

    ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,

      Thanks for correction and formatting advice; I wrote the post in a hurry but have tidied it a bit today.

      This is a list of problems that arose in the real world that, regrettably, I find myself inhabiting; Information comes into my programs from all manner of different places and in some, I can assure you that the open command crashed until I used I first converted the filename viz: my $utf8filename = encode('utf8', $payfile);. Perhaps there was a better way, but sometimes you just need to kill the obscure error expediently.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlmeditation [id://1211014]
Approved by choroba
Front-paged by Discipulus
help
Chatterbox?
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others studying the Monastery: (5)
As of 2018-07-17 04:26 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    It has been suggested to rename Perl 6 in order to boost its marketing potential. Which name would you prefer?















    Results (354 votes). Check out past polls.

    Notices?