http://www.perlmonks.org?node_id=1099752

jkeenan1 has asked for the wisdom of the Perl Monks concerning the following question:

This question concerns strings with UTF-8 characters represented by more than one byte, their representation in various formats, including XML, and their storage in a database.

Here is my string:

ABC╗DEF abc╗def

Note the two instances of RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK (http://www.utf8-chartable.de/). This is Unicode code point U+00BB. Expressed in hexadecimal notation, its UTF-8 encoding is:

c2 bb

So when I examine this string with, say, hexdump -C, I get:

$ echo 'ABC╗DEF abc╗def' | hexdump -C 00000000 41 42 43 c2 bb 44 45 46 20 61 62 63 c2 bb 64 65 |ABC..DEF +abc..de| 00000010 66 0a |f.| 00000012

At $job we have a Catalyst- and REST-based web application which accepts user input and stores it in a PostgreSQL database denominated in UTF-8. I can verify that when I input the string above into a text or varchar field, it is correctly stored in the database -- "French" quotes and all.

In addition, in our Perl codebase we have a test suite in which we set up temporary PostgreSQL databases, make POST calls to that database and then make GET calls to confirm that data has been correctly stored. The data is reported in XML format, so we use Test::XPath to walk the XML to get to the node whose content we wish to validate.

# $res: HTTP::Response object # $tx: Test::XPath object $funny_name = 'ABC╗DEF abc╗def'; $tx->is('/result/entity/prop[@name="name"]/@value',
 $funny_n +ame,
 "Got name '$funny_name'") or diag($res->content);

This test PASSes.

However, should I then use Test::More::diag() to dump the XML content directly:

diag($res->content);

... I get:

# <prop name="name" value="ABC┬╗DEF abc┬╗def" />

In the XML, a LATIN CAPITAL LETTER A WITH CIRCUMFLEX (Unicode code point U+00C2; UTF-8 hexadecimal 'c3 82') is being inserted before the RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK.

Can anyone explain why this is happening?

Jim Keenan
Note to self: This link may be helpful: http://www.psteiner.com/2010/09/how-to-fix-weird-characters-in-xml.html

Replies are listed 'Best First'.
Re: Database vs XML output representation of two-byte UTF-8 character
by ikegami (Pope) on Sep 07, 2014 at 03:18 UTC

    It's the result of double encoding. You encoded an output twice, or you encoded an output based on an input you forgot to decode.

    Looks like you are encoding your output using UTF-8, which is sensible if you were outputting text to a UTF-8 terminal, but you are outputting a raw HTTP response containing an XML document encoded using UTF-8.

      ikegami, I believe your answer is correct. The terminal in question is the Mac OS X Terminal program with Preferences->Settings->Advanced->Character encoding: "Unicode (UTF-8)". I changed my debugging code to:

      diag(Encode::decode_utf8($res->content));
      ... and got the expected output in the Terminal (without affecting what was stored or retrieved from the database.

      Thank you very much.

      Jim Keenan
Re: Database vs XML output representation of two-byte UTF-8 character
by Anonymous Monk on Sep 07, 2014 at 10:59 UTC
    Can anyone explain why this is happening?
    1) Internally, Perl has two different kinds of strings. We'll call them 'binary' and 'unicode' strings.
    $ perl -MDevel::Peek -e 'Dump "Я"'
    FLAGS = (POK,READONLY,IsCOW,pPOK)
    This is a binary string.
    $ perl -MDevel::Peek -e 'use utf8; Dump "Я"'
    FLAGS = (POK,READONLY,IsCOW,pPOK,UTF8)
    This is a unicode string. It has the so-called 'UTF8 flag' turned on, while binary strings don't (internally, Perl 'unicode' strings are encoded in UTF-8).

    2) Perl pretends that it doesn't have two different types of strings. Whenever a binary string enters a 'Unicode context' (so to say), Perl converts binary string to Unicode, with not entirely satisfactory results. Also, it sometimes try to convert Unicode strings to binary, which also doesn't work very well

    $ perl -E 'binmode STDOUT, ":encoding(UTF-8)"; say "Я"'
    đ»
    Binary string "Я" got mangled in Unicode context.
    $ perl -E 'use utf8; my $x = "Я"; no utf8; my $y = "Я"; say $x . $y'
    Wide character in say at -e line 1.
    Яđ»
    Unicode string $x was concatenated with binary string $y, and $y was 'upgraded' to Unicode. At least, we got a warning...
    $ perl -wE 'use utf8; my $x = "š"; no utf8; my $y = "š"; say $x . $y'
    �š
    Where's my warning, Perl?... And what happened with $x???

    3) Perl thinks that all binary strings (those without UTF-8 flag) are encoded in Latin-1. Whenever it sees fit, it converts them to Unicode. And vice versa.

    c2 bb becomes U+00C2 U+00BB. That is, "╗" becomes "┬╗" (from Latin-1 to Unicode). "š" becomes "�" (from Unicode to Latin-1, which cannot be displayed on my terminal).

    4) To make things more interesting, Perl doesn't always turn UTF-8 flag on

    $ perl -MDevel::Peek -E 'use utf8; Dump "This is America!"'
    FLAGS = (POK,READONLY,IsCOW,pPOK)

    Conclusion: to use Perl, you must either be an American, or an expert in Unicode and Perl internals. Well, you seem to be an American, Jim.

      Wow, this is completely wrong.

      Perl has two different kinds of strings. We'll call them 'binary' and 'unicode' strings.

      Awful names, and they have nothing to do with binary or Unicode.

      They are respectively strings of 8-bit chars and string of 72-bit chars.

      it sometimes try to convert Unicode strings to binary, which also doesn't work very well

      No, the problem is that you told it to encode text that was already encoded. It has nothing to do with the internal string formats.

      Binary string "Я" got mangled in Unicode context.

      No, you created garbage by concatenating UTF-8 and text. It has nothing to do with the internal string formats.

      When you do $number + $letters, Perl doesn't mangle anything; you did.
      When you do $text + $utf8, Perl doesn't mangle anything; you did.

      Conclusion: to use Perl, you must either be an American, or an expert in Unicode and Perl internals.

      Just like you wouldn't insert text into SQL without conversion, insert text into HTML without conversion, or insert text into a command line without conversion; all you have to do is not insert text into UTF-8 (or vice-versa) without conversion.

      It doesn't take an American to understand that 4 + apple is going to be garbage. Decode inputs. Encode inputs. That's it.

        Wow, this is completely wrong.
        No, not completely. More importantly, it's a useful way to think about the problem.
        Awful names, and they have nothing to do with binary or Unicode. They are respectively strings of 8-bit chars and string of 72-bit chars.
        Why, 'Unicode' is not an awful name. It's irrelevant that Perl's UTF-8 allows bigger codepoints than the Unicode Consortium defines. 'Binary' is maybe an awful name, but what's more awful is silent conversion from '8-bit chars' to UTF-8, or back.
        No, the problem is that you told it to encode text that was already encoded. It has nothing to do with the internal string formats.
        No, the problem is that mister Keenan, who is an experienced Perl programmer with quite a few modules on CPAN (pardon me if I got that wrong), appears to be confused about Perl's behaviour. It has everything to do with the way Perl works.
        No, you created garbage by concatenating UTF-8 and text. It has nothing to do with the internal string formats.
        No, perl the computer program created garbage, because of the way it works. What does that even mean 'concatenating UTF-8 and text'? Why doesn't that actually work? (you know why). Why can't Perl warn me that I'm doing something stupid? (you know why)
        When you do $number + $letters, Perl doesn't mangle anything; you did. When you do $text + $utf8, Perl doesn't mangle anything; you did.
        But when I did that unreasonable thing Perl didn't try to help me (like it tries to help when I do something like "1 + 'x'" ("argument isn't numeric...")). Yet here we have no warnings, no nothing. So it's not an error in Perl to do something stupid like $text + $utf8, IT'S SUPPOSED TO WORK LIKE THAT. And you know it. So yes, I can say that Perl mangled the strings, because this is the way it's intended to work.
        Just like you wouldn't insert text into SQL without conversion, insert text into HTML without conversion, or insert text into a command line without conversion; all you have to do is not insert text into UTF-8 (or vice-versa) without conversion.
        You know, Ikegami, it's true and not true. I actually know how to use Perl. But Perl provides absolutely no guidance towards that. And...
        Decode inputs. Encode inputs.
        Yes, yes. And how many Perl programs in the wild (or even on CPAN) actually do that? I'd say very few. Do you disagree? I'd even say most Perl programmers actually rarely need to do any encoding/decoding. Do you disagree?
        It doesn't take an American
        Perl works just fine when all you have is ASCII (or Latin-1). If you don't have ASCII/Latin-1... are names of files and directories binary or Unicode? (call it what you will). What about command-line parameters? Do I have to decode them? (yes). Ok, why "...or die $!;" produces garbage? Or right, strerror returned something that is not ASCII/Latin-1 (and I heard some of the porters want to make Perl speak only English, arguing that English is better than mojibake). I'd say it's pretty confusing for your Perl average programmer, let's keep things in perspective, Perl was never supposed to be something hardcore like C++.
      (this f site also doesn't like Anti-American letters and stuff)
Re: Database vs XML output representation of two-byte UTF-8 character
by Anonymous Monk on Sep 07, 2014 at 00:34 UTC

    Is the only problem you're having with Test::More::diag(), or is there a more serious place where the problem occurs?

    $tx->is('/result/entity/prop[@name="name"]/@value',&#8232;    $funny_name,&#8232; ...

    Could you clarify this bit, since it's not valid Perl?

    Have you tried other forms of output, such as writing things to a file (opened with the proper encoding)?

Re: Database vs XML output representation of two-byte UTF-8 character
by Anonymous Monk on Sep 06, 2014 at 23:26 UTC
    Open that file in a hexadecimal editor so that you can see, byte for byte, exactly what those bytes actually are. I would not try to dump the content to the terminal and expect to visually inspect it. There's just too many sources of weirdness that could come into play. Look at the data.
      Open that file

      What file?

      jkeenan1 is working with Catalyst + PostgreSQL ...

        That was my post, which I lost control of, so let me clarify what I was suggesting.   I read the OP to mean that the output of a diag() call within a test suite) was producing results that, say, when printed to the console terminal, seemed to be munged up.   So, what I was suggesting was that you should divert that output instead to a disk-file, then use a hex-editor to examine byte-by-byte exactly what is in-between value=" and the subsequent ".

        We see in a later post that, indeed, the garble was being caused by the character-encoding of the terminal window.   I anticipated that this could be the case, because there are just so-o-ooo many places where encoding/decoding can happen in both directions along that particular food-chain.

        It is also possible, e.g. in MySQL, to dump the contents of a field in hexadecimal form, and once again this is the strategy that I recommend.   Get some view that will show you what the bytes are, making zero attempt to decode them as anything.   Only then can you really know.

Re: Database vs XML output representation of two-byte UTF-8 character
by mje (Curate) on Sep 08, 2014 at 10:09 UTC

    use Test::More::UTF8