Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

Re^2: XML:: DOM and Accented Characters

by freeflyer (Novice)
on Aug 06, 2010 at 14:12 UTC ( [id://853417]=note: print w/replies, xml ) Need Help??


in reply to Re: XML:: DOM and Accented Characters
in thread XML:: DOM and Accented Characters

Thanks almut

I've got that working perfectly on a unix machine but can't for the life of me get it to work under windows. Unfortuantelly the machine where this is to be run is on windows.

Any ideas?

  • Comment on Re^2: XML:: DOM and Accented Characters

Replies are listed 'Best First'.
Re^3: XML:: DOM and Accented Characters
by almut (Canon) on Aug 06, 2010 at 14:36 UTC

    What exactly is the problem, does it still write e9 instead of c3 a9? (hard to believe that this would be different on Windows...)

    Maybe you need to add a BOM?

    open my $fh, ">:utf8", "accentTestOutPut.xml" or die $!; print $fh "\x{feff}"; # BOM $doc->print($fh);
Re^3: XML:: DOM and Accented Characters
by graff (Chancellor) on Aug 06, 2010 at 17:03 UTC
    I think almut hit the mark: MS-Windows apps like wordpad, notepad, etc, all depend on having a file-initial byte-order-mark, expressed as the 3-byte utf8 rendering of the code point "U+FEFF", to serve as a sort of "magic number" so that the app "knows" the file contains utf8 data.

      Thanks for the help but I'm still unable to get it to work even after adding the BOM, although I am learning along the way

      I'm now using both TextPad and NotePad++ (with plugin) to view the codes for the output file (accentTestOutput.xml). I've also run it on both my work and home pc's - both running Windows.

      After running the code provided by almut I'm still not seeing C3 A9 as the hex code for the e-acute. TextPad is displaying an E9 code and NotePad++ EF BF BD. It also looks as if the BOM is not there, I am unable to see the code EF BB BF at the start of the file (which is what I should see right?).

      Using the package UTF8BOM to insert the BOM I can see the BOM is there in both cases (TextPad and NotePad++) due to seeing EF BB BF at the start of the file. However both programs now display E9 as the code for the e-acute not the C3 A9 I'm looking for.

      Incidently at no point have I been able to open the output file in Internet Explorer, It complains of an invalid character at the point of the e-acute.

      Here's the output after trying to insert the BOM using

       print $fh "\x{feff}";

      TextPad

      0: 3C 3F 78 6D 6C 20 76 65 72 73 69 6F 6E 3D 22 31 <?xml version="1 10: 2E 30 22 20 65 6E 63 6F 64 69 6E 67 3D 22 55 54 .0" encoding="UT 20: 46 2D 38 22 3F 3E 0D 0A 3C 54 45 53 54 3E 20 E9 F-8"?>..<TEST> é 30: 20 3C 2F 54 45 53 54 3E 0D 0A </TEST>..

      NotePad++

      3c 3f 78 6d 6c 20 76 65 72 73 69 6f 6e 3d 22 31 2e 30 22 20 65 6e 63 6f 64 69 6e 67 3d 22 55 54 46 2d 38 22 3f 3e 0d 0a 3c 54 45 53 54 3e 20 ef bf bd 20 3c 2f 54 45 53 54 3e 0d 0a

      Here's the output after trying to insert the BOM using the UTF8BOM perl package using

      UTF8BOM->insert_into_file('c:\\accentTestOutPut.xml');

      You can see the BOM code at the begining of the file

      TextPad

      0: EF BB BF 3C 3F 78 6D 6C 20 76 65 72 73 69 6F 6E <?xml version 10: 3D 22 31 2E 30 22 20 65 6E 63 6F 64 69 6E 67 3D ="1.0" encoding= 20: 22 55 54 46 2D 38 22 3F 3E 0D 0A 3C 54 45 53 54 "UTF-8"?>..<TEST 30: 3E 20 E9 20 3C 2F 54 45 53 54 3E 0D 0A > é </TEST>..

      NotePad++

      ef bb bf 3c 3f 78 6d 6c 20 76 65 72 73 69 6f 6e 3d 22 31 2e 30 22 20 65 6e 63 6f 64 69 6e 67 3d 22 55 54 46 2d 38 22 3f 3e 0d 0a 3c 54 45 53 54 3e 20 e9 20 3c 2f 54 45 53 54 3e 0d 0a

      I'm at the edge of what I know so don't really know where to go from here. I appreciate the help you given, any other ideas? If I've missed out some info that may be useful let me know.

        After running the code provided by almut I'm still not seeing C3 A9 as the hex code for the e-acute. TextPad is displaying an E9 code and NotePad++ EF BF BD.

        E9 for your character is windows-1252 according to Wikipedia, which would mean that the perl I/O layer does convert your parsed UTF-8-string into windows-1252 and is ignoring the >:utf8. Maybe you should post you complete code where you parse and save the xml.

        not the C3 A9 I'm looking for

        Then you're not looking for UTF-8!!!!!

        $ perl -e"print qq!\x{C3A9}! Wide character in print at -e line 1. &#8734;Ä&#8976; $ perl -Mopen=:std,:encoding(UTF-8) -e"print qq!\x{C3A9}!" |hexdump 00000000: EC 8E A9 - | | 00000003; $ perl -Mopen=:std,:encoding(UTF-16LE) -e"print qq!\x{C3A9}!" |hexdump 00000000: A9 C3 - | | 00000002; $ perl -Mopen=:std,:encoding(UTF-16BE) -e"print qq!\x{C3A9}!" |hexdump 00000000: C3 A9 - | | 00000002; $
        UTF16-BE shows C3A9, and it is not UTF-8 as encoding="UTF-8"? claims

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://853417]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others imbibing at the Monastery: (5)
As of 2024-04-24 10:35 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found