Re^2: XML:: DOM and Accented Characters

Replies are listed 'Best First'.
Re^3: XML:: DOM and Accented Characters by almut (Canon) on Aug 06, 2010 at 14:36 UTC
What exactly is the problem, does it still write `e9` instead of `c3 a9`? (hard to believe that this would be different on Windows...) Maybe you need to add a BOM? `open my $fh, ">:utf8", "accentTestOutPut.xml" or die $!; print $fh "\x{feff}"; # BOM $doc->print($fh);` [download]	[reply] [d/l] [select]
Re^3: XML:: DOM and Accented Characters by graff (Chancellor) on Aug 06, 2010 at 17:03 UTC
I think almut hit the mark: MS-Windows apps like wordpad, notepad, etc, all depend on having a file-initial byte-order-mark, expressed as the 3-byte utf8 rendering of the code point "U+FEFF", to serve as a sort of "magic number" so that the app "knows" the file contains utf8 data.	[reply]
Re^4: XML:: DOM and Accented Characters by freeflyer (Novice) on Aug 07, 2010 at 10:09 UTC
Thanks for the help but I'm still unable to get it to work even after adding the BOM, although I am learning along the way I'm now using both TextPad and NotePad++ (with plugin) to view the codes for the output file (accentTestOutput.xml). I've also run it on both my work and home pc's - both running Windows. After running the code provided by almut I'm still not seeing C3 A9 as the hex code for the e-acute. TextPad is displaying an E9 code and NotePad++ EF BF BD. It also looks as if the BOM is not there, I am unable to see the code EF BB BF at the start of the file (which is what I should see right?). Using the package UTF8BOM to insert the BOM I can see the BOM is there in both cases (TextPad and NotePad++) due to seeing EF BB BF at the start of the file. However both programs now display E9 as the code for the e-acute not the C3 A9 I'm looking for. Incidently at no point have I been able to open the output file in Internet Explorer, It complains of an invalid character at the point of the e-acute. Here's the output after trying to insert the BOM using `print $fh "\x{feff}";` TextPad `0: 3C 3F 78 6D 6C 20 76 65 72 73 69 6F 6E 3D 22 31 <?xml version="1 10: 2E 30 22 20 65 6E 63 6F 64 69 6E 67 3D 22 55 54 .0" encoding="UT 20: 46 2D 38 22 3F 3E 0D 0A 3C 54 45 53 54 3E 20 E9 F-8"?>..<TEST> é 30: 20 3C 2F 54 45 53 54 3E 0D 0A </TEST>..` [download] NotePad++ `3c 3f 78 6d 6c 20 76 65 72 73 69 6f 6e 3d 22 31 2e 30 22 20 65 6e 63 6f 64 69 6e 67 3d 22 55 54 46 2d 38 22 3f 3e 0d 0a 3c 54 45 53 54 3e 20 ef bf bd 20 3c 2f 54 45 53 54 3e 0d 0a` [download] Here's the output after trying to insert the BOM using the UTF8BOM perl package using `UTF8BOM->insert_into_file('c:\\accentTestOutPut.xml');` You can see the BOM code at the begining of the file TextPad `0: EF BB BF 3C 3F 78 6D 6C 20 76 65 72 73 69 6F 6E ï»¿<?xml version 10: 3D 22 31 2E 30 22 20 65 6E 63 6F 64 69 6E 67 3D ="1.0" encoding= 20: 22 55 54 46 2D 38 22 3F 3E 0D 0A 3C 54 45 53 54 "UTF-8"?>..<TEST 30: 3E 20 E9 20 3C 2F 54 45 53 54 3E 0D 0A > é </TEST>..` [download] NotePad++ `ef bb bf 3c 3f 78 6d 6c 20 76 65 72 73 69 6f 6e 3d 22 31 2e 30 22 20 65 6e 63 6f 64 69 6e 67 3d 22 55 54 46 2d 38 22 3f 3e 0d 0a 3c 54 45 53 54 3e 20 e9 20 3c 2f 54 45 53 54 3e 0d 0a` [download] I'm at the edge of what I know so don't really know where to go from here. I appreciate the help you given, any other ideas? If I've missed out some info that may be useful let me know.	[reply] [d/l] [select]
Re^5: XML:: DOM and Accented Characters by Pickwick (Beadle) on Aug 07, 2010 at 15:26 UTC
After running the code provided by almut I'm still not seeing C3 A9 as the hex code for the e-acute. TextPad is displaying an E9 code and NotePad++ EF BF BD. E9 for your character is windows-1252 according to Wikipedia, which would mean that the perl I/O layer does convert your parsed UTF-8-string into windows-1252 and is ignoring the >:utf8. Maybe you should post you complete code where you parse and save the xml.	[reply]
Re^6: XML:: DOM and Accented Characters by freeflyer (Novice) on Aug 07, 2010 at 18:07 UTC
Re^7: XML:: DOM and Accented Characters by Pickwick (Beadle) on Aug 08, 2010 at 12:59 UTC
Some notes below your chosen depth have not been shown here
Re^5: XML:: DOM and Accented Characters by Anonymous Monk on Aug 07, 2010 at 11:41 UTC
not the C3 A9 I'm looking for Then you're not looking for UTF-8!!!!! `$ perl -e"print qq!\x{C3A9}! Wide character in print at -e line 1. ∞Ä⌐ $ perl -Mopen=:std,:encoding(UTF-8) -e"print qq!\x{C3A9}!" \|hexdump 00000000: EC 8E A9 - \| \| 00000003; $ perl -Mopen=:std,:encoding(UTF-16LE) -e"print qq!\x{C3A9}!" \|hexdump 00000000: A9 C3 - \| \| 00000002; $ perl -Mopen=:std,:encoding(UTF-16BE) -e"print qq!\x{C3A9}!" \|hexdump 00000000: C3 A9 - \| \| 00000002; $` [download] UTF16-BE shows C3A9, and it is not UTF-8 as `encoding="UTF-8"?` claims	[reply] [d/l] [select]
Re^6: XML:: DOM and Accented Characters by freeflyer (Novice) on Aug 07, 2010 at 12:14 UTC
Re^7: XML:: DOM and Accented Characters by Anonymous Monk on Aug 07, 2010 at 12:25 UTC
Some notes below your chosen depth have not been shown here


Perl-Sensitive Sunglasses
	PerlMonks