Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??

In order to deal with Japanese orders, I recently had to convert my whole system to UTF-8. A day or 2's job I thought. 2.5 weeks later, I'm finally there. There is a lot of stuff on Perlmonks and the internet in general about this but it is hard to understand and even harder to implement. Most of the advice I read was along the lines of RTFM or did not give the whole story. It's pretty clear this is a common problem, too. I wanted to give something back to the community as perlmonks has helped me a lot, so I thought I would share some insights that I hope will be practical and useful.

There is a lot out there telling you to used decode/encode and giving lectures on internal representation of UTF8 in Perl and wotnot. In the end I've only had to use decode in one place where data is coming in from elsewhere. If you get all the other stuff right, I believe you shouldn't need any or many instances of decode/encode.

Our system involves a local website using MySQL, a live website, static webpages, generated webpages, various text files and CGI website forms. All of this needs work to make it work. Here are the things that I needed to do:

Checklist of changes to make

* Firstly, every script file is converted to UTF-8 format. Easy.

* Every script to have this at the top: use utf8; This tells perl that the script itself is in UTF format. So a £ in the script will be interpreted as a UTF-8 £. It's no good just putting this in the calling script as it only seems to extend for the scope of the script underneath; not any other scripts that are imported with require...

* Ideally each database table must be turned to UTF-8 format. This turns out to be difficult and time-consuming because any tables with foreign keys won't convert unless you first delete the foreign keys. For those that won't easily convert, you can convert only the fields that might hold UTF-8 encoded characters to UTF-8 format. Also BLOB fields are a problem unless the whole table is UTF-8. I had to convert problem BLOB fields to TEXT fields and then convert them to UTF-8 format (a 2 step process, doing both in 1 step fails).

* Rose::DB (or whatever database method you are using) needs to be told that incoming data from the Database is in UTF-8. For Rose:DB, add this to the connector in DB.pm and then regenerate connect_options => {mysql_enable_utf8 => 1}

* binmode(STDOUT, ":utf8"); # Put this at the top of a script - tells it to output UTF to stdout. Not sure if this is just needed only once in the opening script or in any requires, too?

* Webpages must have this in the head section: <meta http-equiv="content-type" content="text/html; charset=UTF-8">

* use CGI qw(-utf8); to treat incoming CGI parameters as UTF-8. Getting this working was subtle - test carefully.

* When outputting a CGI webpage, the first thing to do is to output the http header and this needs to be told about UTF8 too: Personally I found that print header(-type=>'text/html', -cookie=>'', -charset=>'utf-8'); gave problems with cookies so ended up outputting it direct: print "Content-type: text/html; charset=utf-8\n$cookie\n\n";

* use open ':encoding(utf8)'; # tells it to deal with all files in a UTF8 way. In fact, I was more careful with this and did not use it in general. Instead, I have specifically opened each file that needed it with open($fh, '<:encoding(UTF-8)', $filename);. Because some files that I have to deal with have not been given to me in UTF-8 format. Careful - this can fail if the $filename variable is not also in UTF8!

Identifying Errors

In doing this, you will make mistakes and see weird characters appearing in unexpected places. I developed my own personal understanding of how to deal with them. These are my own notes for practical situations so please bear with me, if the explanations are not exactly correct - it was about fixing stuff not being a perl rocket scientist.

  • You see £ displayed as '£'
    • If £ sign is coming from dbase and is stored correctly in dbase and webpage is correctly displaying UTF-8 characters from elsewhere (e.g. write japanese text into the perl script and print it), then the UTF-8 is not being retrieved from the database as UTF-8 (presumably being assumed to be Latin1).
    • The £ is within a UTF-8 encoded PERL script but use utf8; is not set at the top of the script.
    • The £ is displayed correctly in a form initially but when the form is saved/updated, the £ then displays as '£'. Use the -utf8 CGI pragma to treat incoming parameters as UTF-8: use CGI ('-utf8');
  • £ is displayed on a webpage as �
    • This happens when the http header Content Type is not UTF8 and the meta tag is similarly <meta http-equiv="Content-Type" content="text/html" />
  • £ or other characters are being displayed as a diamond with ? inside it
    • StackOverflow:...usually the sign of an invalid (non-UTF-8) character showing up in an output (like a page) that has been declared to be UTF-8. Can be fixed by putting the following at the top of script: binmode(STDOUT, ":utf8");
  • Error message: Wide character in print
    • Means a print statement (to STDOUT or a file) that is outputting Latin1 includes a UTF-8 character... To fix, add '>:encoding (UTF-8) to the open statement or #binmode(STDOUT, ":utf8");

In reply to Converting everything (MySql, perl, CGI, website) to UTF-8 by jfrm

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others examining the Monastery: (6)
As of 2024-03-28 09:05 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found