Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Re^3: XML::Simple parser error : Input is not proper UTF-8, indicate encoding

by BrowserUk (Pope)
on Aug 10, 2012 at 13:46 UTC ( #986749=note: print w/replies, xml ) Need Help??


in reply to Re^2: XML::Simple parser error : Input is not proper UTF-8, indicate encoding
in thread XML::Simple parser error : Input is not proper UTF-8, indicate encoding

Hm....maybe you need to update your copy of xmlint?

"XML 1.1 extends the set of allowed characters to include all the above, plus the remaining characters in the range U+0001ľU+001F. At the same time, however, it restricts the use of C0 and C1 control characters other than U+0009, U+000A, U+000D, and U+0085 by requiring them to be written in escaped form (for example U+0001 must be written as  or its equivalent). In the case of C1 characters, this restriction is a backwards incompatibility; it was introduced to allow common encoding errors to be detected."

From what I can make out; having an encoding header is both obligatory, and required to make sense of how entities should be interpreted.


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

The start of some sanity?

  • Comment on Re^3: XML::Simple parser error : Input is not proper UTF-8, indicate encoding
  • Download Code

Replies are listed 'Best First'.
Re^4: XML::Simple parser error : Input is not proper UTF-8, indicate encoding
by daxim (Chaplain) on Aug 10, 2012 at 13:49 UTC
    The OP, like the rest of the world, is using XML 1.0. XML 1.1 made too little progress and gained no adoption.

    Edited to add: nah, I'm good.

    $ rpm -qf `which xmllint` libxml2-2.7.8+git20110708-3.8.1.x86_64
      The OP, ..., is using XML 1.0.

      If we're being pedantic, the OPs problem is that he isn't using any form of XML!

      But if he decides to do so, he can make up his own mind about which standard he chooses, because -- despite what the "rest of the world" is using -- the tools support it (even without a header!):

      #! perl -slw use strict; use Data::Dump qw[ pp ]; use XML::Simple; my $xml = XMLin( \*DATA ); pp $xml; __DATA__ <EVENT> <CALLDETAILS> <STATIONID>01</STATIONID> <CALLSESSIONID>00000000020712130852059</CALLSESSIONID> <EXTENSIONNO>8143</EXTENSIONNO> <ZIVAHCHANNELID>172.16.39.88</ZIVAHCHANNELID> <SUBCHANNELID>0</SUBCHANNELID> <AGENTID>NULL</AGENTID> <CALLERID>&#xA0;jW&#xB7;h&#xAE;&#xF5;&#xBF;&#x8A;7a&#xB7;&#xD8 +;T&#xD9;^N</CALLERID> <CALLEEID>NULL</CALLEEID> <CALLTYPE>IN</CALLTYPE> <RINGCOUNT>1</RINGCOUNT> <CALLTERMSTATUS>NO_CTI_DATA</CALLTERMSTATUS> </CALLDETAILS> </EVENT>

      Produces:

      [14:58:34.75] C:\test>xmlent.pl { CALLDETAILS => { AGENTID => "NULL", CALLEEID => "NULL", CALLERID => pack("H*","a06a57b768aef5bf8a3761b7d854d95e4 +e"), CALLSESSIONID => "00000000020712130852059", CALLTERMSTATUS => "NO_CTI_DATA", CALLTYPE => "IN", EXTENSIONNO => 8143, RINGCOUNT => 1, STATIONID => "01", SUBCHANNELID => 0, ZIVAHCHANNELID => "172.16.39.88", }, }

      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

      The start of some sanity?

        nvivek arrived at this weird notation with angles and ^ followed by a letter by displaying it in vi. Your example program is wrong, it has a literal ^ and N as you neglected to substitute this notation for the original character. When the is corrected, the program predictably bombs out with:
        $ perl pm986755.pl Entity: line 9: parser error : PCDATA invalid Char value 14 <CALLERID>&#xA0;jW&#xB7;h&#xAE;&#xF5;&#xBF;&#x8A;7a&#xB7;&#xD8 +;T&#xD9;< + ^
        When the character is substituted with the character reference &#x0e;, it also bombs out:
        $ perl pm986755.pl Entity: line 9: parser error : xmlParseCharRef: invalid xmlChar value +14 <CALLERID>&#xA0;jW&#xB7;h&#xAE;&#xF5;&#xBF;&#x8A;7a&#xB7;&#xD8;T&# +xD9;&#x0e; + ^
        Upgrading the version in the PI to 1.1 does not help. XML-Simple respectively its underlying modules XML::Parser/expat and XML::LibXML/libxml2 cannot deal with XML 1.1!

        Your advice was flawed from the beginning, it simply cannot work in the general case. Whatever puts control characters there is apt to also put a chr(0) character. No matter whether plain character or character reference, it's illegal in all versions of XML.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://986749]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (8)
As of 2018-11-15 14:21 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    My code is most likely broken because:
















    Results (186 votes). Check out past polls.

    Notices?
    • (Sep 10, 2018 at 22:53 UTC) Welcome new users!