http://www.perlmonks.org?node_id=607738

xaprb has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I've read enough about encodings and collations and such to make my brain squishy. However, I still can't figure out how to solve my problem.

The problem is I need to compare character data in a way consistent with MySQL. I am comparing data in a table across two servers. Assume the tables have identical structures in every way, but may contain different data. I connect to each server and select col1 from each table. col1 is a varchar column with charset latin1 and collation latin1_swedish_ci (these are defaults). Now I loop through the rows returned from each table, comparing the data to determine which rows are extra or missing between the two tables.

MySQL sorts case-insensitive (hence latin1_swedish_ci) of course. I can duplicate this without problem: I just use "lc $a cmp lc $b".

This works okay until I come to a word like éclair. MySQL knows that this should be sorted just before "e". Perl thinks é is greater than "z".

I can go the other way, and make MySQL sort things the same way as Perl with "ORDER BY BINARY col1", but this defeats indexes and makes MySQL sort inefficiently, and that is highly undesirable for my purposes.

I feel sure there's a way to do this, but danged if I can figure out how. I have 8 Firefox tabs open with everything from perluniintro to Encode::Byte, and my brain is full :-) Anyone? Anyone?

Thank you!

UPDATE The idea to reach back to MySQL and ask it to compare the strings is how I'm going to try to do it. Thank you all very much for the great ideas!

  • Comment on locales, encodings, collations, charsets... how can I match a given MySQL collation?

Replies are listed 'Best First'.
Re: locales, encodings, collations, charsets... how can I match a given MySQL collation?
by rhesa (Vicar) on Apr 02, 2007 at 03:13 UTC
    Have a look at Unicode::Collate.
    #!/usr/bin/perl use Unicode::Collate; my $coll = Unicode::Collate->new; my $e = "\xe9clair"; my @cmp = qw/ aardvark each eden zarkon /; for my $d ( @cmp ) { printf "%10s ccmp %5s = %d\n", $d, $e, $coll->cmp( $d, $e ); } __END__ aardvark ccmp éclair = -1 each ccmp éclair = -1 eden ccmp éclair = 1 zarkon ccmp éclair = 1
    Update:: Another approach would be to use Text::Unidecode, and strip off any accents before doing the comparison.

      I'm still nervous about this approach matching exactly MySQL. I need to match it exactly, even if MySQL's sorting is wrong, or the algorithm will run off the rails. I think varian's idea will probably work well. Thank you for the suggestions though!

Re: locales, encodings, collations, charsets... how can I match a given MySQL collation?
by GrandFather (Saint) on Apr 02, 2007 at 02:11 UTC

    Sounds a bit XYish. Something like List::Compare may be more appropriate if what you are doing is finding various intersections and disjunctions between sets of things.


    DWIM is Perl's answer to Gödel

      Thanks for the link. I am sure I can use that module in a script I wrote at work, but not here. Here, I may be comparing two large tables and finding one or two rows that differ. I can't put it all in memory, especially not in seen-hashes. I need to be able to suck rows one at a time from MySQL, in sorted order, and have Perl know when a row is missing from one table because one row is "greater than" or "less than" the other.

      I've seen the term XY but not seen a definition. What does it mean?

        Generally an XY question is one that pertains to solving a detail problem without providing the larger context. Very often in such a case the larger context suggests a better solution than solving the detail problem.

        I don't see a better solution straight off for your problem here however, although the larger context that you've provided may help those more database savvy than I to find a good solution for you.


        DWIM is Perl's answer to Gödel
        xaprb:

        Perhaps the right approach would be to use a mergesort-like method. Basically, execute a select on each table, specifying the order so MySQL will create the proper order for you. Then you can process the tables in parallel. To see an example, I previously described a similar thing in Re: How to deal with Huge data.

        ...roboticus

Re: locales, encodings, collations, charsets... how can I match a given MySQL collation?
by varian (Chaplain) on Apr 02, 2007 at 08:14 UTC
    Yes, sorting can be tricky between two different environments. Note that this is not just between Perl and MySQL, because the way a database gets sorted can be subject to locale on the database server side (so 2 databases can be sorted differently) and it can get even worse.

    So you need a solution that does all sorting in one environment or does not rely on sorting at all. Here are some options:

    1) if the tables are relatively small so that they could fit into memory consider to have Perl do the matching for you. Load the row data of table A into a hash A (key=col1), then get each row from table B (the other server), and check if it exists/matches data in hash A. If so then data matches and you can delete the hash A key. If there was no match then you have hit an entry unique to one table.

    2) for larger tables you will need to rely on the database server to perform match operations for you. See if you can copy the content of the table in database A to a temporary table in database B. Next compare the content of the two tables in database B via a select statement that returns rows that do not match.

    3) if all that is not feasible then your last resort is to select all rows from table A, and foreach row in table A do a lookup in table B via a SELECT WHERE on the indexed key in table B to see if there is a corresponding row. When finished you will have to do the reverse as well to find rows in table B that do not have a matching row in table A.
    make sure to prepare the SELECT WHERE sql outside of the loop, because to reissue the same sql statement with a new bind value is relatively inexpensive, especially when searching over an indexed column.

    Update: a little trick to use MySQL's sorting rules within Perl would be to do a database call with no table (or the dummy table 'dual') with a select with two bind variables, e.g.:

    SELECT ? > ?;
    The result of this query will be true (1) or false (0)

    Now why didn't I think of that option before?

      Slight clarification: I don't know about other RDBMS products, but in MySQL's case, the sorting is only dependent on the column's collation, which is easy to find with SHOW FULL COLUMNS. If my tool discovers that the two columns have different collations, I think it will have to die with an error.

        Note that even in MySQL if not defined explicitly collation will be dependent upon collation settings for database and if that is default then even upon collation for server instance.

        BTW the handling of collation is vendor dependent so if you want to stick to vendor-neutral DBI then you don't want to count on it.

      Oooh, that's a great idea! It would still require a potentially large amount of network traffic, but there's an additional benefit: if I specify the collation explicitly when I create the $sth, I can let MySQL handle any coercion between collations if the columns in the two tables do have different collations. Then there's a much better shot at getting consistent behavior.

      Yep, I had thought about all these things too, but was trying my best to find a way to do it in Perl to be as network and time efficient as possible. *sigh* It may be an impossible dream. There are a lot of variables, even in options 2 and 3.

Re: locales, encodings, collations, charsets... how can I match a given MySQL collation?
by dk (Chaplain) on Apr 02, 2007 at 08:06 UTC
    Possibly you've already figured it out, but that's the effect of not doing use locale (perldoc locale) . Because indeed, perl -le 'print sort qw(e é f z)' prints e f z é, but env LANG=sv_SE.ISO8859-1 perl -Mlocale -le 'print sort qw(e é f z)' does it the right way, e é f z.

      I read in the man pages (but forget where) that locale was deprecated in favor of Unicode modules.

        I've never seen anything like that, but even if you are correct, I fail to see how that invalidates my answer.