Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

XML:: DOM and Accented Characters

by freeflyer (Novice)
on Aug 06, 2010 at 09:17 UTC ( [id://853358]=perlquestion: print w/replies, xml ) Need Help??

freeflyer has asked for the wisdom of the Perl Monks concerning the following question:

Hi

I'm fairly new to Perl and have come up against the following issue. We have (at work) an XML file that contains accented characters. These accented characters are not displaying correctly when parsed and saved back out to a new file using XML:: DOM.

I've created a small test XML (accentTest.xml) to demonstrate what I am seeing:

<?xml version="1.0" encoding="UTF-8"?> <TEST> </TEST>

And below is the perl code that reads this in and saves it back out as accentTestOutPut.xml.

use XML::DOM; my $parser = new XML::DOM::Parser; my $doc = $parser->parsefile ("c:\\accentTest.xml"); # Print doc file $doc->printToFile ("c:\\accentTestOutPut.xml"); # Print to string print $doc->toString; # cleanup $doc->dispose;

I've used TextPad to view the files in binary format.

Prior to parsing 'accentTest.xml' the hex code used for the e-acute is 'C3 A9' which is correct according to the UTF-8 encoding table @ (http://www.utf8-chartable.de/) the file is also saved as UTF-8 (according to notepad).

After being saved ( $doc->printToFile ("c:\\accentTestOutPut.xml") and viewing in TextPad the hex code used for the e-acute is 'E9' which does not seem to be a valid UTF-8 hex code, the file itself is saved as ANSI (according to notepad anyway). If I view this file in PSPad I can see the e-acute whereas if I use NotePad++ I can not. I am far from an expert but it seems to have something to do with encoding??

If I manually resave "c:\\accentTestOutPut.xml" (using notepad) as UTF-8 I can see my e-acute again in both PSPad and NotePad++.

Has anyone any ideas as to what is going on, hopefully I've explained the issue clearly.

Using XML::LibXML I do not experience the same issue but I have been asked not to use this if possible.

Replies are listed 'Best First'.
Re: XML:: DOM and Accented Characters
by almut (Canon) on Aug 06, 2010 at 10:27 UTC

    It seems the printToFile method has not been written with unicode in mind1... but you could create a properly UTF-8-encoding file handle yourself, and then either use ->printToFileHandle, or simply ->print

    ... open my $fh, ">:utf8", "accentTestOutPut.xml" or die $!; # Print doc file $doc->printToFileHandle($fh); # or $doc->print($fh);
    $ hexdump -C accentTestOutPut.xml 00000000 3c 3f 78 6d 6c 20 76 65 72 73 69 6f 6e 3d 22 31 |<?xml ver +sion="1| 00000010 2e 30 22 20 65 6e 63 6f 64 69 6e 67 3d 22 55 54 |.0" encod +ing="UT| 00000020 46 2d 38 22 3f 3e 0a 3c 54 45 53 54 3e 20 c3 a9 |F-8"?>.<T +EST> ..| 00000030 20 3c 2f 54 45 53 54 3e 0a | </TEST>. +| ^^^^^

    ___

    1 current implementation:

    sub printToFile { my ($self, $fileName) = @_; my $fh = new FileHandle ($fileName, "w") || croak "printToFile - can't open output file $fileName"; $self->print ($fh); $fh->close; }

      Thanks almut

      I've got that working perfectly on a unix machine but can't for the life of me get it to work under windows. Unfortuantelly the machine where this is to be run is on windows.

      Any ideas?

        What exactly is the problem, does it still write e9 instead of c3 a9? (hard to believe that this would be different on Windows...)

        Maybe you need to add a BOM?

        open my $fh, ">:utf8", "accentTestOutPut.xml" or die $!; print $fh "\x{feff}"; # BOM $doc->print($fh);
        I think almut hit the mark: MS-Windows apps like wordpad, notepad, etc, all depend on having a file-initial byte-order-mark, expressed as the 3-byte utf8 rendering of the code point "U+FEFF", to serve as a sort of "magic number" so that the app "knows" the file contains utf8 data.
Re: XML:: DOM and Accented Characters
by ikegami (Patriarch) on Aug 07, 2010 at 16:33 UTC
    On Windows,
    use strict; use warnings; use XML::DOM; my $xml = <<"__EOI__"; <?xml version="1.0" encoding="UTF-8"?> <TEST> \xC3\xA9 </TEST> __EOI__ my $parser = new XML::DOM::Parser; my $doc = $parser->parse($xml); $doc->printToFile("test.xml");
    >perl a.pl >perl -e"$/=\16; while (<>) { my $s=uc unpack 'H*', $_; $s=~s/..\K/ /g +; print qq{$s\n}; }" test.xml 3C 3F 78 6D 6C 20 76 65 72 73 69 6F 6E 3D 22 31 2E 30 22 20 65 6E 63 6F 64 69 6E 67 3D 22 55 54 46 2D 38 22 3F 3E 0A 3C 54 45 53 54 3E 20 E9 20 3C 2F 54 45 53 54 3E 0A

    As previously shown, XML::DOM doesn't encode for you (as it should). So let's try with the previously mentioned fix:

    use strict; use warnings; use XML::DOM; my $xml = <<"__EOI__"; <?xml version="1.0" encoding="UTF-8"?> <TEST> \xC3\xA9 </TEST> __EOI__ my $parser = new XML::DOM::Parser; my $doc = $parser->parse($xml); open my $fh, ">:utf8", "test.xml" or die $!; $doc->printToFileHandle($fh);
    >perl a.pl >perl -e"$/=\16; while (<>) { my $s=uc unpack 'H*', $_; $s=~s/..\K/ /g +; print qq{$s\n}; }" test.xml 3C 3F 78 6D 6C 20 76 65 72 73 69 6F 6E 3D 22 31 2E 30 22 20 65 6E 63 6F 64 69 6E 67 3D 22 55 54 46 2D 38 22 3F 3E 0A 3C 54 45 53 54 3E 20 C3 A9 20 3C 2F 54 45 53 54 3E 0A

    Perl did its thing correctly, so you have a problem with your editor. There are some solutions:

    • Tell your editor the file is encoded using UTF-8 through its menus.

    • Add a BOM. Most editors use this ass a signal that the file is encoded using UTF-8.

      open my $fh, ">:utf8", "test.xml" or die $!; print($fh "\x{FEFF}"); $doc->printToFileHandle($fh);
    • Use the encoding the editor expects (cp1252?)

      ...Fix the <?xml?> line... open my $fh, ">:encoding(cp1252)", "test.xml" or die $!; $doc->printToFileHandle($fh);

    You might want to check (using the above command) to make sure your input contains what you think it contains.

      Hi ikegami

      I've tried selecting utf8 in editor menus and inserting a BOM but neither has seemingly worked. I think my files are coming out windows-1252 encoded because without runnig the code you provided and just changing the 1st line to

      <?xml version="1.0" encoding="windows-1252"?>

      results in me being able to open the file OK

      What is confusing me is that running the code below

      #!/bin/perl -w use XML::DOM; use PerlIO::encoding; my $parser = new XML::DOM::Parser; my $doc = $parser->parsefile ("c:\\accentTest.xml", ProtocolEncoding = +> 'UTF-8'); # Print doc file $doc->printToFile ("c:\\accentTestOutPut.xml"); open my $fh, ">:encoding(UTF-8)", "accentTestOutPut.xml" or die $!; $doc->print($fh); $doc->dispose;

      In windows results in a file (that to my untrained eye) appears to not be UTF8 encoded (in hex I do not see the C3 A9 for the e-acute) and will not open without the above mentioned 1st line change, however

      If I run the same code in Unix and open the resulting file in windows its all fine. It appears properly utf8 encoded and viewing the file in hex shows the C3 A9 expected for the e-acute

      At the moment I'm not understanding why the problem is with the editors (not saying it isn't just don't understand why yet). Whats confusing me is the file created using the same code on Unix opens without issue?

        I think my files are coming out windows-1252 encoded

        You think? I gave you a tool to check. I also asked that you check your input file.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://853358]
Approved by Corion
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others sharing their wisdom with the Monastery: (5)
As of 2024-04-13 18:42 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found