http://www.perlmonks.org?node_id=266177

bjelli has asked for the wisdom of the Perl Monks concerning the following question:

Dear fellow monks

i'm trying to use perl(5.8) + dbi(1.37) + dbd::mysql(2.1026) + mysql(4.1.0-alpha) with unicode.

as far as i can tell i can write a utf8 string into the database, and get back the same sequence of bits, only now it's a 'classical' perl-string, not flagged as utf-8.

the string i write into the db is 6 characters long: "ABc\N{greek:alpha}\x{00df}\N{cyrillic:e}"

    character           unicode utf8
                        hex     binary

    A                   0041    01000001
    B                   0042    01000010
    c                   0063    01100011
    greep alpha         03B1    1100111010110001
    german scharfes s   00DF    1100001110011111
    cyrrillic e         044D    1101000110001101
what i get back from the db is
    A                           01000001
    B                           01000010
    c                           01100011
    ?                           11001110
    ?                           10110001
    ?                           11000011
    ?                           00111111
    ?                           11010001
    ?                           00111111

I have tried to convert this using $new = decode_utf8( $fromdb ); but all i get is an empty string. is there some way to find out why this won't decode?

or is my debugging stuff that shows me the bits in the string just wrong:

sub showbits { my ($template, $utf, $result, $i); $utf = is_utf8 $_[0]; $template = $utf ? "U*" : "C*"; foreach ( unpack($template, $_[0] ) ) { $result .= "\n" ; $result .= substr( $_[0], $i, 1 ) . "="; $result .= sprintf ("%04X", $_) . "="; if ( $utf and $_ > 127) { $b = unpack("B*", substr( $_[0], $i, 1 )); } else { $b = unpack("B*", pack("N", $_ )); } $b =~ s/^0{32}//; # leading zeros $b =~ s/^0{16}//; $b =~ s/^0{8}//; $result .= $b; $i++; } return $result; }
--
Brigitte    'I never met a chocolate I didnt like'    Jellinek
http://www.horus.com/~bjelli/         http://perlwelt.horus.at

Replies are listed 'Best First'.
Re: unicode (and mysql)
by zby (Vicar) on Jun 16, 2003 at 15:05 UTC
    The last byte "00111111" should be "10001101" (as the last 8 bits in the cyrylic e). A mistake in copying or is it in the centre of the problem?
Re: unicode (and mysql)
by PodMaster (Abbot) on Jun 16, 2003 at 14:12 UTC
    update: You may wish to read "Handling Malformed Data" in Encode.

    MJD says "you can't just make shit up and expect the computer to know what you mean, retardo!"
    I run a Win32 PPM repository for perl 5.6.x and 5.8.x -- I take requests (README).
    ** The third rule of perl club is a statement of fact: pod is sexy.

Re: unicode (and mysql)
by yosefm (Friar) on Jun 16, 2003 at 18:48 UTC
    As far as I know from surfing a bit in the site of mysql AB (the company that makes mysql), Mysql still does not support unicode.

    so, you have two options:

    • You can try to write utf8 and do whatever to fix it when it comes back.
    • Use Text::Iconv to send your string to the DB in a supported encoding (usually iso-8859-x works, this includes Latin1) and translate it back to utf when you get it from the DB.
    Anyway, I'd like to use this post to add that I hate l10n. That'll be all.
Re: unicode (and mysql)
by bjelli (Pilgrim) on Jun 17, 2003 at 09:16 UTC
      By the way I did work with Postgres, perl and unicode. It mainly works, but collating is implemented for few scripts (at least it was last autumn).
      Of course, I use production versions only. Have fun with unicode...
Switching strings' UTF-8 bits under older Perls
by andrewc (Acolyte) on Jun 17, 2003 at 12:00 UTC

    There are monks who have to coax UTF-8 out of older mysql servers under older versions of Perl in their daily labours. The builtins pack() and unpack() can be used in such cases, for versions of Perl >= 5.6.

    I offer the following humble script, in the hope that parts of it might aid others in their assigned tasks.

    #!/usr/bin/perl # Demonstrate some real-world encoding fixes. # This has been tested on Perls 5.6.1 and 5.8.0. BEGIN { require v5.6.0; } use utf8; no bytes; binmode STDOUT, ':utf8' if $] >= 5.008; # suppresses a warning # Correctly encoded data (string-is-unicode bit set) print "\n\$good:\n"; $good = chr(0x03B1) . chr(0x00DF) . chr(0x044D); print $good, "\n", length($good), "\n"; # three characters # Make a copy without the string-is-unicode bit set on it # This is the kind of thing DBD::mysql returns if you put something li +ke $good # into the database originally. binmode STDOUT, ':bytes' if $] >= 5.008; print "\n\$bad:\n"; $bad = pack("C0C*", unpack("C0C*", $good)); print $bad, "\n", length($bad), "\n"; # six bytes print "\n(\$bad eq \$good): " . (($bad eq $good) ? "yes" : "no") . "\n +"; # At Perl 5.6.1, this says "yes". # At Perl 5.8.0, this says "no". # Repack the bad string into another correctly-tagged string binmode STDOUT, ':utf8' if $] >= 5.008; print "\n\$also_good:\n"; $also_good = pack("U0U*", unpack("U0U*", $bad)); print $also_good, "\n", length($also_good), "\n"; print "\n(\$bad eq \$also_good): " . (($bad eq $also_good) ? "yes" : "no") . "\n"; print "(\$good eq \$also_good): " . (($good eq $also_good) ? "yes" : "no") . "\n\n";

    There's a meditation in here on "U0U*" vs. "U*", I think.

    Note: I use a UTF-8-capable terminal, hence the fiddling with binmode. That's another can of worms though.