Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

comment on

( #3333=superdoc: print w/replies, xml ) Need Help??

In order to deal with Japanese orders, I recently had to convert my whole system to UTF-8. A day or 2's job I thought. 2.5 weeks later, I'm finally there. There is a lot of stuff on Perlmonks and the internet in general about this but it is hard to understand and even harder to implement. Most of the advice I read was along the lines of RTFM or did not give the whole story. It's pretty clear this is a common problem, too. I wanted to give something back to the community as perlmonks has helped me a lot, so I thought I would share some insights that I hope will be practical and useful.

There is a lot out there telling you to used decode/encode and giving lectures on internal representation of UTF8 in Perl and wotnot. In the end I've only had to use decode in one place where data is coming in from elsewhere. If you get all the other stuff right, I believe you shouldn't need any or many instances of decode/encode.

Our system involves a local website using MySQL, a live website, static webpages, generated webpages, various text files and CGI website forms. All of this needs work to make it work. Here are the things that I needed to do:

Checklist of changes to make

* Firstly, every script file is converted to UTF-8 format. Easy.

* Every script to have this at the top: use utf8; This tells perl that the script itself is in UTF format. So a in the script will be interpreted as a UTF-8 . It's no good just putting this in the calling script as it only seems to extend for the scope of the script underneath; not any other scripts that are imported with require...

* Ideally each database table must be turned to UTF-8 format. This turns out to be difficult and time-consuming because any tables with foreign keys won't convert unless you first delete the foreign keys. For those that won't easily convert, you can convert only the fields that might hold UTF-8 encoded characters to UTF-8 format. Also BLOB fields are a problem unless the whole table is UTF-8. I had to convert problem BLOB fields to TEXT fields and then convert them to UTF-8 format (a 2 step process, doing both in 1 step fails).

* Rose::DB (or whatever database method you are using) needs to be told that incoming data from the Database is in UTF-8. For Rose:DB, add this to the connector in DB.pm and then regenerate connect_options => {mysql_enable_utf8 => 1}

* binmode(STDOUT, ":utf8"); # Put this at the top of a script - tells it to output UTF to stdout. Not sure if this is just needed only once in the opening script or in any requires, too?

* Webpages must have this in the head section: <meta http-equiv="content-type" content="text/html; charset=UTF-8">

* use CGI qw(-utf8); to treat incoming CGI parameters as UTF-8. Getting this working was subtle - test carefully.

* When outputting a CGI webpage, the first thing to do is to output the http header and this needs to be told about UTF8 too: Personally I found that print header(-type=>'text/html', -cookie=>'', -charset=>'utf-8'); gave problems with cookies so ended up outputting it direct: print "Content-type: text/html; charset=utf-8\n$cookie\n\n";

* use open ':encoding(utf8)'; # tells it to deal with all files in a UTF8 way. In fact, I was more careful with this and did not use it in general. Instead, I have specifically opened each file that needed it with open($fh, '<:encoding(UTF-8)', $filename);. Because some files that I have to deal with have not been given to me in UTF-8 format. Careful - this can fail if the $filename variable is not also in UTF8!

Identifying Errors

In doing this, you will make mistakes and see weird characters appearing in unexpected places. I developed my own personal understanding of how to deal with them. These are my own notes for practical situations so please bear with me, if the explanations are not exactly correct - it was about fixing stuff not being a perl rocket scientist.

  • You see displayed as '£'
    • If sign is coming from dbase and is stored correctly in dbase and webpage is correctly displaying UTF-8 characters from elsewhere (e.g. write japanese text into the perl script and print it), then the UTF-8 is not being retrieved from the database as UTF-8 (presumably being assumed to be Latin1).
    • The is within a UTF-8 encoded PERL script but use utf8; is not set at the top of the script.
    • The is displayed correctly in a form initially but when the form is saved/updated, the then displays as '£'. Use the -utf8 CGI pragma to treat incoming parameters as UTF-8: use CGI ('-utf8');
  • is displayed on a webpage as �
    • This happens when the http header Content Type is not UTF8 and the meta tag is similarly <meta http-equiv="Content-Type" content="text/html" />
  • or other characters are being displayed as a diamond with ? inside it
    • StackOverflow:...usually the sign of an invalid (non-UTF-8) character showing up in an output (like a page) that has been declared to be UTF-8. Can be fixed by putting the following at the top of script: binmode(STDOUT, ":utf8");
  • Error message: Wide character in print
    • Means a print statement (to STDOUT or a file) that is outputting Latin1 includes a UTF-8 character... To fix, add '>:encoding (UTF-8) to the open statement or #binmode(STDOUT, ":utf8");

In reply to Converting everything (MySql, perl, CGI, website) to UTF-8 by jfrm

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?
    Username:
    Password:

    What's my password?
    Create A New User
    Chatterbox?
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others chanting in the Monastery: (3)
    As of 2021-01-24 15:03 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?
      Notices?