Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

Re^2: UTF-8 and XML::LibXML

by davies (Prior)
on Nov 26, 2019 at 12:17 UTC ( [id://11109248]=note: print w/replies, xml ) Need Help??


in reply to Re: UTF-8 and XML::LibXML
in thread UTF-8 and XML::LibXML

I have read that, but I am obviously missing the point somewhere. The point where it says "most functions of XML::LibXML that work with in-memory trees accept and return data as character strings (i.e. UTF-8 encoded with the UTF8 flag on)" made me think I would get the same encoded data out as I put in. This is repeated in the second of the basic rules and principles. I'm afraid I can't see anything indicating how to avoid the behaviour I see.

Regards,

John Davies

Replies are listed 'Best First'.
Re^3: UTF-8 and XML::LibXML
by choroba (Cardinal) on Nov 26, 2019 at 12:31 UTC
    UTF-8 flag is a misnomer. Strings with this flag are Perl internal Unicode, not UTF-8. When creating/loading XML, use bytes. When supplying values to methods, use Unicode strings (e.g. $element->appendText( $unicode );).

    map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]

      I don't think I'm getting confused by the flag as I'm not trying to read or write it. It's the text that talks about character strings that are UTF-8 encoded that may be confusing me, since the output is decoded. I thought I was creating the XML using bytes in the 6th line of my code, but if I'm getting that wrong, I would be interested. But that's not the real problem as I'm getting the same two bytes in my code and in the real files. The only method to which I believe I'm supplying values is the parser. I believe that I am putting encoded data in and getting decoded data back. That is the problem I am trying to solve - I can't see from the docs how to get encoded data back.

      Regards,

      John Davies

        I can't see from the docs how to get encoded data back

        You might not be able to. The whole point of the parser is to extract the information represented by the XML document, no matter how it's encoded using XML.

        You shouldn't have to care whether "Ú" is stored as Ú, bytes C3 9A (in an XML document that uses UTF-8), or byte DA (in an XML document that uses cp1252). Nor should you want to know.

        You're not putting any data in. You're creating the XML using bytes, you're getting decoded data back from a method, exactly as documented.

        This is what putting decoded data in means:

        my ($container) = $dom->findnodes('/container'); my $n2 = $container->appendChild('XML::LibXML::Element'->new('node2')) +; $n2->appendText("\N{LATIN CAPITAL LETTER U WITH ACUTE}"); binmode *STDOUT, ':encoding(UTF-8)'; print $dom;

        map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
        I can't see from the docs how to get encoded data back

        I didn't find a way either, but probably this is intentional because you shouldn't. It is bad practice. As soon as Perl has parsed your document into a tree, it is entitled to forget in whatever encoding it was delivered.

        If you want encoded data back, then you get to chose the encoding, and encode by yourself.

        I also think that lots of Perl module documentation should be revisited with regard to the ominous "UTF-8 flag". The parenthesis "(UTF-8 encoded with UTF8 flag on)" is at least misleading and should best be eradicated: the relevant thing is "character string", as opposed to "binary" string ("bytes" and "encoded" strings are binary for that purpose). For the user of any module it isn't relevant in which encoding Perl stores character strings internally.

Re^3: UTF-8 and XML::LibXML
by ikegami (Patriarch) on Nov 26, 2019 at 20:30 UTC

    "UTF-8 encoded with the UTF8 flag on" means decoded strings (strings of Unicode Code Points), not strings encoded using UTF-8 (strings of bytes). This is the right thing to do.

    As such, $node contains decode("UTF-8", chr(195) . chr(154)), which is chr(218) ("\N{U+00DA}").

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11109248]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others wandering the Monastery: (4)
As of 2024-03-30 08:41 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found