Database vs XML output representation of two-byte UTF-8 characterby jkeenan1 (Deacon)
|on Sep 06, 2014 at 12:31 UTC||Need Help??|
jkeenan1 has asked for the wisdom of the Perl Monks concerning the following question:
This question concerns strings with UTF-8 characters represented by more than one byte, their representation in various formats, including XML, and their storage in a database.
Here is my string:
Note the two instances of RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK (http://www.utf8-chartable.de/). This is Unicode code point U+00BB. Expressed in hexadecimal notation, its UTF-8 encoding is:
So when I examine this string with, say, hexdump -C, I get:
At $job we have a Catalyst- and REST-based web application which accepts user input and stores it in a PostgreSQL database denominated in UTF-8. I can verify that when I input the string above into a text or varchar field, it is correctly stored in the database -- "French" quotes and all.
In addition, in our Perl codebase we have a test suite in which we set up temporary PostgreSQL databases, make POST calls to that database and then make GET calls to confirm that data has been correctly stored. The data is reported in XML format, so we use Test::XPath to walk the XML to get to the node whose content we wish to validate.
This test PASSes.
However, should I then use Test::More::diag() to dump the XML content directly:
... I get:
In the XML, a LATIN CAPITAL LETTER A WITH CIRCUMFLEX (Unicode code point U+00C2; UTF-8 hexadecimal 'c3 82') is being inserted before the RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK.
Can anyone explain why this is happening?
Note to self: This link may be helpful: http://www.psteiner.com/2010/09/how-to-fix-weird-characters-in-xml.html