http://www.perlmonks.org?node_id=853371


in reply to XML:: DOM and Accented Characters

It seems the printToFile method has not been written with unicode in mind1... but you could create a properly UTF-8-encoding file handle yourself, and then either use ->printToFileHandle, or simply ->print

... open my $fh, ">:utf8", "accentTestOutPut.xml" or die $!; # Print doc file $doc->printToFileHandle($fh); # or $doc->print($fh);
$ hexdump -C accentTestOutPut.xml 00000000 3c 3f 78 6d 6c 20 76 65 72 73 69 6f 6e 3d 22 31 |<?xml ver +sion="1| 00000010 2e 30 22 20 65 6e 63 6f 64 69 6e 67 3d 22 55 54 |.0" encod +ing="UT| 00000020 46 2d 38 22 3f 3e 0a 3c 54 45 53 54 3e 20 c3 a9 |F-8"?>.<T +EST> ..| 00000030 20 3c 2f 54 45 53 54 3e 0a | </TEST>. +| ^^^^^

___

1 current implementation:

sub printToFile { my ($self, $fileName) = @_; my $fh = new FileHandle ($fileName, "w") || croak "printToFile - can't open output file $fileName"; $self->print ($fh); $fh->close; }

Replies are listed 'Best First'.
Re^2: XML:: DOM and Accented Characters
by freeflyer (Novice) on Aug 06, 2010 at 14:12 UTC

    Thanks almut

    I've got that working perfectly on a unix machine but can't for the life of me get it to work under windows. Unfortuantelly the machine where this is to be run is on windows.

    Any ideas?

      What exactly is the problem, does it still write e9 instead of c3 a9? (hard to believe that this would be different on Windows...)

      Maybe you need to add a BOM?

      open my $fh, ">:utf8", "accentTestOutPut.xml" or die $!; print $fh "\x{feff}"; # BOM $doc->print($fh);
      I think almut hit the mark: MS-Windows apps like wordpad, notepad, etc, all depend on having a file-initial byte-order-mark, expressed as the 3-byte utf8 rendering of the code point "U+FEFF", to serve as a sort of "magic number" so that the app "knows" the file contains utf8 data.

        Thanks for the help but I'm still unable to get it to work even after adding the BOM, although I am learning along the way

        I'm now using both TextPad and NotePad++ (with plugin) to view the codes for the output file (accentTestOutput.xml). I've also run it on both my work and home pc's - both running Windows.

        After running the code provided by almut I'm still not seeing C3 A9 as the hex code for the e-acute. TextPad is displaying an E9 code and NotePad++ EF BF BD. It also looks as if the BOM is not there, I am unable to see the code EF BB BF at the start of the file (which is what I should see right?).

        Using the package UTF8BOM to insert the BOM I can see the BOM is there in both cases (TextPad and NotePad++) due to seeing EF BB BF at the start of the file. However both programs now display E9 as the code for the e-acute not the C3 A9 I'm looking for.

        Incidently at no point have I been able to open the output file in Internet Explorer, It complains of an invalid character at the point of the e-acute.

        Here's the output after trying to insert the BOM using

         print $fh "\x{feff}";

        TextPad

        0: 3C 3F 78 6D 6C 20 76 65 72 73 69 6F 6E 3D 22 31 <?xml version="1 10: 2E 30 22 20 65 6E 63 6F 64 69 6E 67 3D 22 55 54 .0" encoding="UT 20: 46 2D 38 22 3F 3E 0D 0A 3C 54 45 53 54 3E 20 E9 F-8"?>..<TEST> é 30: 20 3C 2F 54 45 53 54 3E 0D 0A </TEST>..

        NotePad++

        3c 3f 78 6d 6c 20 76 65 72 73 69 6f 6e 3d 22 31 2e 30 22 20 65 6e 63 6f 64 69 6e 67 3d 22 55 54 46 2d 38 22 3f 3e 0d 0a 3c 54 45 53 54 3e 20 ef bf bd 20 3c 2f 54 45 53 54 3e 0d 0a

        Here's the output after trying to insert the BOM using the UTF8BOM perl package using

        UTF8BOM->insert_into_file('c:\\accentTestOutPut.xml');

        You can see the BOM code at the begining of the file

        TextPad

        0: EF BB BF 3C 3F 78 6D 6C 20 76 65 72 73 69 6F 6E <?xml version 10: 3D 22 31 2E 30 22 20 65 6E 63 6F 64 69 6E 67 3D ="1.0" encoding= 20: 22 55 54 46 2D 38 22 3F 3E 0D 0A 3C 54 45 53 54 "UTF-8"?>..<TEST 30: 3E 20 E9 20 3C 2F 54 45 53 54 3E 0D 0A > é </TEST>..

        NotePad++

        ef bb bf 3c 3f 78 6d 6c 20 76 65 72 73 69 6f 6e 3d 22 31 2e 30 22 20 65 6e 63 6f 64 69 6e 67 3d 22 55 54 46 2d 38 22 3f 3e 0d 0a 3c 54 45 53 54 3e 20 e9 20 3c 2f 54 45 53 54 3e 0d 0a

        I'm at the edge of what I know so don't really know where to go from here. I appreciate the help you given, any other ideas? If I've missed out some info that may be useful let me know.