Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling

comment on

( #3333=superdoc: print w/replies, xml ) Need Help??

This question concerns strings with UTF-8 characters represented by more than one byte, their representation in various formats, including XML, and their storage in a database.

Here is my string:

ABC»DEF abc»def

Note the two instances of RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK ( This is Unicode code point U+00BB. Expressed in hexadecimal notation, its UTF-8 encoding is:

c2 bb

So when I examine this string with, say, hexdump -C, I get:

$ echo 'ABC»DEF abc»def' | hexdump -C 00000000 41 42 43 c2 bb 44 45 46 20 61 62 63 c2 bb 64 65 |ABC..DEF| 00000010 66 0a |f.| 00000012

At $job we have a Catalyst- and REST-based web application which accepts user input and stores it in a PostgreSQL database denominated in UTF-8. I can verify that when I input the string above into a text or varchar field, it is correctly stored in the database -- "French" quotes and all.

In addition, in our Perl codebase we have a test suite in which we set up temporary PostgreSQL databases, make POST calls to that database and then make GET calls to confirm that data has been correctly stored. The data is reported in XML format, so we use Test::XPath to walk the XML to get to the node whose content we wish to validate.

# $res: HTTP::Response object # $tx: Test::XPath object $funny_name = 'ABC»DEF abc»def'; $tx->is('/result/entity/prop[@name="name"]/@value',
 $funny_n +ame,
 "Got name '$funny_name'") or diag($res->content);

This test PASSes.

However, should I then use Test::More::diag() to dump the XML content directly:


... I get:

# <prop name="name" value="ABC»DEF abc»def" />

In the XML, a LATIN CAPITAL LETTER A WITH CIRCUMFLEX (Unicode code point U+00C2; UTF-8 hexadecimal 'c3 82') is being inserted before the RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK.

Can anyone explain why this is happening?

Jim Keenan
Note to self: This link may be helpful:

In reply to Database vs XML output representation of two-byte UTF-8 character by jkeenan1

Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":

  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?

    What's my password?
    Create A New User
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others exploiting the Monastery: (7)
    As of 2021-03-02 19:13 GMT
    Find Nodes?
      Voting Booth?
      My favorite kind of desktop background is:

      Results (59 votes). Check out past polls.