unicode (and mysql)

bjelli has asked for the wisdom of the Perl Monks concerning the following question:

Dear fellow monks

i'm trying to use perl(5.8) + dbi(1.37) + dbd::mysql(2.1026) + mysql(4.1.0-alpha) with unicode.

as far as i can tell i can write a utf8 string into the database, and get back the same sequence of bits, only now it's a 'classical' perl-string, not flagged as utf-8.

the string i write into the db is 6 characters long: "ABc\N{greek:alpha}\x{00df}\N{cyrillic:e}"

    character           unicode utf8
                        hex     binary

    A                   0041    01000001
    B                   0042    01000010
    c                   0063    01100011
    greep alpha         03B1    1100111010110001
    german scharfes s   00DF    1100001110011111
    cyrrillic e         044D    1101000110001101

what i get back from the db is

    A                           01000001
    B                           01000010
    c                           01100011
    ?                           11001110
    ?                           10110001
    ?                           11000011
    ?                           00111111
    ?                           11010001
    ?                           00111111

I have tried to convert this using $new = decode_utf8( $fromdb ); but all i get is an empty string. is there some way to find out why this won't decode?

or is my debugging stuff that shows me the bits in the string just wrong:


sub showbits
{
    my ($template, $utf, $result, $i);
    $utf =  is_utf8  $_[0];
    $template = $utf ? "U*" : "C*";
    foreach ( unpack($template, $_[0] ) )
    {
        $result .= "\n" ;
        $result .= substr( $_[0], $i, 1 ) . "=";
        $result .= sprintf ("%04X", $_) .  "=";
        if ( $utf and $_ > 127) {
                $b = unpack("B*", substr( $_[0], $i, 1 ));
        }
        else {
                $b = unpack("B*", pack("N", $_ ));
        }
        $b =~ s/^0{32}//;  # leading zeros
        $b =~ s/^0{16}//;
        $b =~ s/^0{8}//;
        $result .= $b;
        $i++;
    }
    return $result;
}
[download]

--
Brigitte    'I never met a chocolate I didnt like'    Jellinek
http://www.horus.com/~bjelli/         http://perlwelt.horus.at

Comment on unicode (and mysql) Select or Download Code

Replies are listed 'Best First'.

Re: unicode (and mysql)
by zby (Vicar) on Jun 16, 2003 at 15:05 UTC

The last byte "00111111" should be "10001101" (as the last 8 bits in the cyrylic e). A mistake in copying or is it in the centre of the problem?

[reply]

Re: unicode (and mysql)
by PodMaster (Abbot) on Jun 16, 2003 at 14:12 UTC

A simple search for mysql unicode yields:

unicode file in mysql

update:

Encode

MJD says "you can't just make shit up and expect the computer to know what you mean, retardo!"
I run a Win32 PPM repository for perl 5.6.x and 5.8.x -- I take requests (README).
** The third rule of perl club is a statement of fact: pod is sexy.

[reply]

Re: unicode (and mysql)
by yosefm (Friar) on Jun 16, 2003 at 18:48 UTC

so, you have two options:

You can try to write utf8 and do whatever to fix it when it comes back.
Use Text::Iconv to send your string to the DB in a supported encoding (usually iso-8859-x works, this includes Latin1) and translate it back to utf when you get it from the DB.

I hate l10n

[reply]

Re: unicode (and mysql)
by bjelli (Pilgrim) on Jun 17, 2003 at 09:16 UTC

zby

yosefm said that mysql still doen't support unicode - well, the 4.1.0-alpha boasts 'extensive unicode support'. I have managed to store utf-8 data, and retrieve it, and search for unicode characters via the command line client.

if anyone's intrested in this: I keep a journal on my adventures with mysql + perl + unicode on the web

--
Brigitte    'I never met a chocolate I didnt like'    Jellinek
http://www.horus.com/~bjelli/         http://perlwelt.horus.at

[reply]

Re: Re: unicode (and mysql)

by zby (Vicar) on Jun 17, 2003 at 09:33 UTC

By the way I did work with Postgres, perl and unicode. It mainly works, but collating is implemented for few scripts (at least it was last autumn).

[reply]

Re: Re: unicode (and mysql)

by yosefm (Friar) on Jun 17, 2003 at 10:01 UTC

Of course, I use production versions only. Have fun with unicode...

[reply]

Switching strings' UTF-8 bits under older Perls
by andrewc (Acolyte) on Jun 17, 2003 at 12:00 UTC

There are monks who have to coax UTF-8 out of older mysql servers under older versions of Perl in their daily labours. The builtins pack() and unpack() can be used in such cases, for versions of Perl >= 5.6.

I offer the following humble script, in the hope that parts of it might aid others in their assigned tasks.

#!/usr/bin/perl
# Demonstrate some real-world encoding fixes.
# This has been tested on Perls 5.6.1 and 5.8.0.

BEGIN {
        require v5.6.0;
}
use utf8;
no bytes;
binmode STDOUT, ':utf8' if $] >= 5.008;   # suppresses a warning

# Correctly encoded data (string-is-unicode bit set)

print "\n\$good:\n";
$good = chr(0x03B1) . chr(0x00DF) . chr(0x044D);
print $good, "\n",
      length($good), "\n";     # three characters


# Make a copy without the string-is-unicode bit set on it
# This is the kind of thing DBD::mysql returns if you put something li
+ke $good
# into the database originally.

binmode STDOUT, ':bytes' if $] >= 5.008;
print "\n\$bad:\n";
$bad = pack("C0C*", unpack("C0C*", $good));
print $bad, "\n",
      length($bad), "\n";      # six bytes

print "\n(\$bad eq \$good): " . (($bad eq $good) ? "yes" : "no") . "\n
+";
# At Perl 5.6.1, this says "yes".
# At Perl 5.8.0, this says "no".

# Repack the bad string into another correctly-tagged string

binmode STDOUT, ':utf8' if $] >= 5.008;
print "\n\$also_good:\n";
$also_good = pack("U0U*", unpack("U0U*", $bad));
print $also_good, "\n",
      length($also_good), "\n";


print "\n(\$bad eq \$also_good): "
        . (($bad eq $also_good) ? "yes" : "no") . "\n";
print "(\$good eq \$also_good): "
        . (($good eq $also_good) ? "yes" : "no") . "\n\n";
[download]

There's a meditation in here on "U0U*" vs. "U*", I think.

Note: I use a UTF-8-capable terminal, hence the fiddling with binmode. That's another can of worms though.

[reply]
[d/l]

Back to Seekers of Perl Wisdom