dealing with encoding while converting data from MySQL to Postgres

punkish has asked for the wisdom of the Perl Monks concerning the following question:

I am at my wits end moving data from MySQL to Postgres (long story).

I inherit the MySQL data dump. I am pretty certain encoding of the original db is all messed up; it is supposed to be utf8, is actually set as latin1, and has data in it that would be appropriate for utf8. Its first few lines are like so

    /*!40101 SET @OLD_CHARACTER_SET_CLIENT=@@CHARACTER_SET_CLIENT */;
    /*!40101 SET @OLD_CHARACTER_SET_RESULTS=@@CHARACTER_SET_RESULTS */
+;
    /*!40101 SET @OLD_COLLATION_CONNECTION=@@COLLATION_CONNECTION */;
    /*!40101 SET NAMES utf8 */;
    
    CREATE DATABASE /*!32312 IF NOT EXISTS*/ `dbname` /*!40100 DEFAULT
+ CHARACTER SET latin1 */;
    
    USE `dbname`;
    
    DROP TABLE IF EXISTS `foo`;
    /*!40101 SET @saved_cs_client     = @@character_set_client */;
    /*!40101 SET character_set_client = utf8 */;
    CREATE TABLE `foo` (
    ..
[download]

Besides the above, I know nothing more about the encoding. The new Pg database is utf8, and is in my control. Ideally, I would like to set up my routine so the Pg database is all correctly encoded in utf8. In other words, whatever the MySQL encoding, I want it to end up in Pg as utf8.

There are several complicated data types in the MySQL db, so I am unable to simply dump from MySQL and restore in Pg. I am now trying to simply query one db, and insert the values in the other db using Perl DBI. I keep on getting the error of the type 'invalid byte sequence for encoding "UTF8": 0x93 at...'. Here are my steps:

dump data from MySQL (actually, I inherit the dump, so I have no control over how the dump is actually created)
$ iconv -f latin1 -t UTF-8 in.sql > in_utf8.sql
$ mysql < in_utf8.sql
$ perl sync.pl

bam! error. The `sync.pl` script is a pretty simple

query table in mysql

while (my ($col1, $col2... $coln) = $sth->fetch row_array) {
    ## I believe this is where I should be converting the $col1..n 
    ## values to utf8, but don't know how.
    insert into Pg
}
[download]

I have also tried the following variation -- changed step #2 above to sed 's/latin1/utf8/g' in.sql > in_utf8.sql and then loaded that in mysql in step #3 above before trying to insert it in Pg. Still, no joy. Similar utf8 encoding error.

Suggestions?

when small people start casting long shadows, it is time to go to bed

Comment on dealing with encoding while converting data from MySQL to Postgres Select or Download Code

Replies are listed 'Best First'.
Re: dealing with encoding while converting data from MySQL to Postgres by moritz (Cardinal) on Dec 10, 2011 at 05:54 UTC
Besides the above, I know nothing more about the encoding. Then you need to learn more about the encoding. I've seen enough crazy things (like putting UTF-8 into latin-1 tables) to advise you strongly not to blindly trust a DB, but look at the data that comes out of it, and find out which encoding(s) it uses. Yes, that takes time and effort, but if you don't invest it, you'll have ten times the effort fixing it later on. `# $ iconv -f latin1 -t UTF-8 in.sql > in_utf8.sql # $ mysql < in_utf8.sql` [download] That's a pretty bad idea. As you've even shown us, the mysql dump contains meta information about the character encoding, which you don't change. Assuming that the meta information was accurate before your change, it is now certainly wrong. So, don't do that. Rather use the original database (or recreate it from the dump), and connect to that, and do all encoding conversion with Perl. bam! error. The `sync.pl` script is a pretty simple `while (my ($col1, $col2... $coln) = $sth->fetch row_array) { ## I believe this is where I should be converting the $col1..n ## values to utf8, but don't know how. insert into Pg }` [download] You haven't shown us the interesting part where you set up DBD::mysql and DBD::Pg to deal with UTF-8. If you haven't done so, please read the documentation of these two modules regarding the handling of encodings in general, and UTF-8 in particular. As for converting the encoding, Encode can do all that is left to do after you set up the two DBD modules correctly (which could be nothing at all). `while (my ($col1, $col2... $coln) =` I'd simply use `while (my @columns = ...` here instead, less typing. Encoding problems should be treated like any other problems in programming. You have to look carefully at the input data, decide what the result should be, and trace the data as it is processed in your program to the point where it deviates from your expectation. Once you found that point, fix it and continue. See also: Character Encodings in Perl. Perl 6 - second systems done right	[reply] [d/l] [select]
Re^2: dealing with encoding while converting data from MySQL to Postgres by punkish (Priest) on Dec 11, 2011 at 04:20 UTC
Thanks moritz. Adding `{mysql_enably_utf8 => 1}` and `{pg_enable_utf8 => 1}` to the db handles for the two databases seems to have solved the problem. There were other issues of incompatibly between the two databases that I had to solve, but now I can move data from one to the other. when small people start casting long shadows, it is time to go to bed	[reply] [d/l] [select]
Re: dealing with encoding while converting data from MySQL to Postgres by grantm (Parson) on Dec 12, 2011 at 01:14 UTC
`'invalid byte sequence for encoding "UTF8": 0x93 at...'` That message suggests that (at least this row of) your data is not encoded in UTF-8 or Latin-1 but actually CP1252 - Microsoft's embraced and extended Latin-1 which replaces some of the control characters with smart quotes etc. If your data has a mixture of UTF-8, ASCII, Latin-1 and CP1252 then fixing that is exactly the problem that Encoding::FixLatin was designed to fix. The Encoding::FixLatin distribution includes a command-line tool called `fix_latin` which you can pipe your dump file through. In cases such as yours where you're not able to restore a dump, then you can use the Encoding::FixLatin module in your script that does the inserts and call the `fix_latin()` function to convert each column value to UTF8. Note: when you connect to the Postgres database you'll want to set the `pg_enable_utf8` flag in the connection attributes so that you get UTF8 characters back from a select.	[reply] [d/l] [select]
Re: dealing with encoding while converting data from MySQL to Postgres by Anonymous Monk on Dec 10, 2011 at 11:52 UTC
This is a tough situation, and personally I would go through every row, every text column in the original database, then inspect it with e.g. `Encode::is_utf8()`, and if it happens to not be valid utf8, try to decode it with `Encode::decode("cp1252", $str)` ("cp1252" seems to be a good bet for many western languages), and only then insert it to the postgres database. This will probably leave you with some corrupted entries which you can then later figure out how to detect and fix.	[reply] [d/l] [select]
Re^2: dealing with encoding while converting data from MySQL to Postgres by Anonymous Monk on Dec 10, 2011 at 12:09 UTC
...and even then you may have to watch out for doubly-encoded UTF-8 (maybe the database driver does that, maybe it was inserted to the database that way, maybe ...). You should also look into Text::Iconv for your conversion as it works with raw bytes and does not mind perl's "utf8 bit."	[reply]
Re: dealing with encoding while converting data from MySQL to Postgres by Anonymous Monk on Dec 10, 2011 at 11:31 UTC
That is a very tough data-format to be having to accept as input. Is there absolutely no way to get XML from that MySQL database? Do you have any say at all about what it is you "inherit?"	[reply]

Back to Seekers of Perl Wisdom