Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

Comment on

( #3333=superdoc: print w/ replies, xml ) Need Help??
  • In ASCII, a character always maps to a single byte.

  • In UTF-8, a character may map to multiple bytes.

  • For Perl to know whether the data it receives from an external source (which could be a string, or binary data such as an image) as a string of bytes or as a UTF-8 string, it uses the internal UTF8 flag.

  • Nothing external to Perl (eg the console or the database) knows about this flag, so we need to transform all input/output data into a form that each program understands, which we do using Encode.

  • To convert an input string of bytes which represents a UTF-8 string, into Perl's internal string format, we DECODE the byes to from UTF-8, using Encode::decode() or Encode::decode_utf8()

  • To convert a Perl string into a string of bytes representing a UTF-8 for other programs to understand, we ENCODE the string using Encode::encode() or Encode::encode_utf8()

An example round trip

The steps below demonstrate how to accept UTF-8 strings from outside, store them in MySQL, retrieve them from the database, and re-output them

Manual, or with Automatic with PerlIO layers MySQL::dbd > v 4 --------------- ---------------- --------- --------- ----- +----- UTF-8 ---decode_utf8()->> Perl ---encode_utf8()->> UTF +-8 Console <<-encode_utf8()--- strings <<-decode_utf8()--- MySQ +L DB --------- --------- ----- +-----
  1. Create a table in MySQL which uses the UTF-8 character set:

    This step ensures that all UTF-8 aware programs that interact with this database know to treat the stored data as UTF-8

    CREATE TABLE test_db.test ( string VARCHAR(50) ) CHARACTER SET utf +8;
  2. Get a UTF-8 string:

    This step accepts a string of bytes representing a UTF-8 string, and converts them into Perl's internal string format.

    • From a UTF-8 console:
      use Encode qw( decode_utf8 ); my $string = <>; my $utf8_string = decode_utf8($string);
    • or, from an ISO-8859-1 console:
      use Encode qw( decode ); my $iso_8859_string = <>; my $utf8_string = decode('ISO-8859-1',$iso_8859_string);
    • or from within a Perl script:
      use utf8; # Tells Perl that the script itself is written i +n UTF-8 my $utf8_string = "UTF-8 string with special chars: ";
  3. Open a UTF-8 enabled database connection:

    This step connects to the database, and tells DBD::mysql to auto-convert to/from UTF-8.

    IMPORTANT: This requires a version of DBD::mysql greater than version 4

    use DBI(); my $dbh = DBI->connect ('dbi:mysql:test_db', $username, $password, {mysql_enable_utf8 => 1} );
  4. Write to and read from the DB:

    $dbh->do('INSERT INTO test_db.test VALUES(?)', $utf8_string); $dbh->do('SELECT string FROM test_db.test LIMIT 1'); my $new_string = $dbh->fetchrow_arrayref->[0];
  5. Display the retrieved string:

    The output data needs to be converted from Perl's internal format into a string of bytes that the console will understand.

    • on a UTF-8 console:
      use Encode qw( encode_utf8 ); print Encode::encode_utf8($new_string); OR # Add an auto-encoding layer binmode (STDIN,':utf8'); print $new_string;
    • or, on an ISO-8859-1 console:
      use Encode qw( encode ); print Encode::encode('ISO-8859-1', $new_string);
For more info, see perlunitut: Unicode in Perl, perluniintro, perlunicode, perlrun, binmode, open and PerlIO.

UPDATE - Added readmore tags. Added diagram illustrating round trip

UPDATE - Corrected a type: TO utf8, not from utf8


In reply to A UTF8 round trip with MySQL by clinton

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?
    Username:
    Password:

    What's my password?
    Create A New User
    Chatterbox?
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others lurking in the Monastery: (7)
    As of 2015-07-04 15:43 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?

      The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









      Results (60 votes), past polls