jkeenan1 has asked for the wisdom of the Perl Monks concerning the following question:
This question concerns strings with UTF-8 characters represented by more than one byte, their representation in various formats, including XML, and their storage in a database.
Here is my string:
Note the two instances of RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK (http://www.utf8-chartable.de/). This is Unicode code point U+00BB. Expressed in hexadecimal notation, its UTF-8 encoding is:
So when I examine this string with, say, hexdump -C, I get:
$ echo 'ABC╗DEF abc╗def' | hexdump -C 00000000 41 42 43 c2 bb 44 45 46 20 61 62 63 c2 bb 64 65 |ABC..DEF +abc..de| 00000010 66 0a |f.| 00000012
At $job we have a Catalyst- and REST-based web application which accepts user input and stores it in a PostgreSQL database denominated in UTF-8. I can verify that when I input the string above into a text or varchar field, it is correctly stored in the database -- "French" quotes and all.
In addition, in our Perl codebase we have a test suite in which we set up temporary PostgreSQL databases, make POST calls to that database and then make GET calls to confirm that data has been correctly stored. The data is reported in XML format, so we use Test::XPath to walk the XML to get to the node whose content we wish to validate.
# $res: HTTP::Response object # $tx: Test::XPath object $funny_name = 'ABC╗DEF abc╗def'; $tx->is('/result/entity/prop[@name="name"]/@value',
"Got name '$funny_name'") or diag($res->content);
This test PASSes.
However, should I then use Test::More::diag() to dump the XML content directly:
... I get:
# <prop name="name" value="ABC┬╗DEF abc┬╗def" />
In the XML, a LATIN CAPITAL LETTER A WITH CIRCUMFLEX (Unicode code point U+00C2; UTF-8 hexadecimal 'c3 82') is being inserted before the RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK.
Can anyone explain why this is happening?