in reply to Re: UTF-8 and XML::LibXML in thread UTF-8 and XML::LibXML
I have read that, but I am obviously missing the point somewhere. The point where it says "most functions of XML::LibXML that work with in-memory trees accept and return data as character strings (i.e. UTF-8 encoded with the UTF8 flag on)" made me think I would get the same encoded data out as I put in. This is repeated in the second of the basic rules and principles. I'm afraid I can't see anything indicating how to avoid the behaviour I see.
Regards,
John Davies
Re^3: UTF-8 and XML::LibXML
by choroba (Cardinal) on Nov 26, 2019 at 12:31 UTC
|
UTF-8 flag is a misnomer. Strings with this flag are Perl internal Unicode, not UTF-8. When creating/loading XML, use bytes. When supplying values to methods, use Unicode strings (e.g. $element->appendText( $unicode );).
map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
| [reply] [d/l] [select] |
|
I don't think I'm getting confused by the flag as I'm not trying to read or write it. It's the text that talks about character strings that are UTF-8 encoded that may be confusing me, since the output is decoded. I thought I was creating the XML using bytes in the 6th line of my code, but if I'm getting that wrong, I would be interested. But that's not the real problem as I'm getting the same two bytes in my code and in the real files. The only method to which I believe I'm supplying values is the parser. I believe that I am putting encoded data in and getting decoded data back. That is the problem I am trying to solve - I can't see from the docs how to get encoded data back.
Regards,
John Davies
| [reply] |
|
I can't see from the docs how to get encoded data back
You might not be able to. The whole point of the parser is to extract the information represented by the XML document, no matter how it's encoded using XML.
You shouldn't have to care whether "Ú" is stored as Ú, bytes C3 9A (in an XML document that uses UTF-8), or byte DA (in an XML document that uses cp1252). Nor should you want to know.
| [reply] [d/l] [select] |
|
my ($container) = $dom->findnodes('/container');
my $n2 = $container->appendChild('XML::LibXML::Element'->new('node2'))
+;
$n2->appendText("\N{LATIN CAPITAL LETTER U WITH ACUTE}");
binmode *STDOUT, ':encoding(UTF-8)';
print $dom;
map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
| [reply] [d/l] [select] |
|
I can't see from the docs how to get encoded data back
I didn't find a way either, but probably this is intentional because you shouldn't. It is bad practice. As soon as Perl has parsed your document into a tree, it is entitled to forget in whatever encoding it was delivered.
If you want encoded data back, then you get to chose the encoding, and encode by yourself.
I also think that lots of Perl module documentation should be revisited with regard to the ominous "UTF-8 flag". The parenthesis "(UTF-8 encoded with UTF8 flag on)" is at least misleading and should best be eradicated: the relevant thing is "character string", as opposed to "binary" string ("bytes" and "encoded" strings are binary for that purpose). For the user of any module it isn't relevant in which encoding Perl stores character strings internally.
| [reply] |
|
Re^3: UTF-8 and XML::LibXML
by ikegami (Patriarch) on Nov 26, 2019 at 20:30 UTC
|
"UTF-8 encoded with the UTF8 flag on" means decoded strings (strings of Unicode Code Points), not strings encoded using UTF-8 (strings of bytes). This is the right thing to do.
As such, $node contains decode("UTF-8", chr(195) . chr(154)), which is chr(218) ("\N{U+00DA}").
| [reply] [d/l] [select] |
|
|