Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

Problem upgrading XML::Fast from 0.11 to 0.17

by mje (Curate)
on Sep 21, 2017 at 09:33 UTC ( [id://1199814]=perlquestion: print w/replies, xml ) Need Help??

mje has asked for the wisdom of the Perl Monks concerning the following question:

I've been using XML::Fast to process XML files for some time and successfully. However, the code was moved to a newer machine and has stopped working in some circumstances. A difference between the machines is XML::Fast version, 0.11 on original machine (working) and 0.17 on new machine (not working). When no other changes are made but to upgrade to 0.17 on the old machine it also stops working.

The error I'm getting is:

Failed to encode 2017-9-21T08-49-17.XML to JSON for indexing - malform +ed or illegal unicode character in string [�ndby IF], cannot c +onvert to JSON at xx.pm line 1827.

The XML file comes from a 3rd party and is ISO-8859-1 encoded. The bit it is complaining about is <Value>Br<F8>ndby IF</Value>. A cut down version of the XML which fails is:

<?xml version="1.0" encoding="ISO-8859-1"?> <xx feedtype="delta"><Timestamp CreatedTime="2017-09-21T06:49:17" Time +Zone="GMT"/><Value>Brøndby IF</Value></xx>

The code which is now failing is:

use Cpanel::JSON::XS; use XML::Fast; sub esIndexFile2 { my ($self, $file) = @_; my $xml = do { local $/ = undef; open (my $fh, "<:encoding(ISO-8859-1)", $file) or die "Failed +to open $file - $!"; <$fh>; }; $xml =~ s/^(?:.*\n)//; # remove first line - the encoding lin +e my $hash; eval { $hash = xml2hash $xml; }; if (my $ev = $@) { warn("Failed to parse file $file for indexing - $@ - SKIPPING" +); return; } my $json = eval { encode_json($hash); # <------------ fails here }; if (my $ev = $@) { $self->logwarn("Failed to encode $file to JSON for indexing - +$@ - SKIPPING"); return; } return 1; }

The changes file for XML::Fast is not too helpful. I have discovered adding utf8decode => 1 to the xml2hash makes it work now but I don't really understand why. I am doing anything wrong here? What might have changed in XML::Fast to cause this to happen?

Replies are listed 'Best First'.
Re: Problem upgrading XML::Fast from 0.11 to 0.17
by ablanke (Monsignor) on Sep 22, 2017 at 08:50 UTC
    Hi mje,

    maybe this diff is helpful.

    there is a change in the xs file:

    -SV* -_xml2hash(xml,conf) - char *xml; +void +_xml2hash(xml_sv,conf) + SV *xml_sv; HV *conf; PROTOTYPE: $$ - CODE: + PPCODE: + SvGETMAGIC(xml_sv); + char *xml = SvPVbyte_nolen(xml_sv);

    after all without adding the utf8decode option xml2hash marks your hash value as utf8.

      SvPVbyte_nolen already handles magic, so the SvGETMAGIC(xml_sv) is superfluous and should be eliminated. Otherwise, this change does indeed fix a bug in the module.

        LOL! Just noticed I submitted the change, including the superfluous SvGETMAGIC(xml_sv). I'll let the maintainer know.

      Thanks for that ablanke.

      I don't understand this change. I've read in a correctly ISO-8859-1 encoded file and used the same encoding layer when opening it. I then pass it to xml2hash and get rubbish out unless I set this ambiguous utf8decode option documented as "Force decoding of utf8 sequences, instead of just upgrading them (may be useful for broken xml)". My XML is not broken.

      $ od -c xx.XML 0000000 < ? x m l v e r s i o n = " +1 0000020 . 0 " e n c o d i n g = " I +S 0000040 O - 8 8 5 9 - 1 " ? > \n < x x + 0000060 f e e d t y p e = " d e l t a +" 0000100 > < T i m e s t a m p C r e +a 0000120 t e d T i m e = " 2 0 1 7 - 0 +9 0000140 - 2 1 T 0 6 : 4 9 : 1 7 " T +i 0000160 m e Z o n e = " G M T " / > < +V 0000200 a l u e > B r 370 n d b y I F +< 0000220 / V a l u e > < / x x > \n 0000235

      Note the 370 octal which is F8 and "ø". This also looks correct to me:

      $ perl -le 'use 5.016;use Devel::Peek; my $xml = do {local $/ = undef; + open(my $fh, "<:encoding(ISO-8859-1)", "xx.XML"); <$fh>;}; say Dump +($xml);' SV = PV(0x1a25da0) at 0x1a3a010 REFCNT = 1 FLAGS = (POK,IsCOW,pPOK,UTF8) PV = 0x1c22ea0 "<?xml version=\"1.0\" encoding=\"ISO-8859-1\"?>\n<xx + feedtype=\"delta\"><Timestamp CreatedTime=\"2017-09-21T06:49:17\" Ti +meZone=\"GMT\"/><Value>Br\303\270ndby IF</Value></xx>\n"\0 [UTF8 "<?x +ml version="1.0" encoding="ISO-8859-1"?>\n<xx feedtype="delta"><Times +tamp CreatedTime="2017-09-21T06:49:17" TimeZone="GMT"/><Value>Br\x{f8 +}ndby IF</Value></xx>\n"] CUR = 158 LEN = 160 COW_REFCNT = 1
Re: Problem upgrading XML::Fast from 0.11 to 0.17 (updated)
by haukex (Archbishop) on Sep 22, 2017 at 08:58 UTC
    I have discovered adding utf8decode => 1 to the xml2hash makes it work now but I don't really understand why. I am doing anything wrong here? What might have changed in XML::Fast to cause this to happen?

    The changelog isn't really verbose, but I suspect it might be #71532 which was marked as fixed recently. There is also #71533 which is still open, and which indicates that the module might have issues with Unicode data in general. I haven't investigated deeper than that*, but if you have verified with a hex dump that your input file is indeed ISO-8859-1, where "ø" should be represented by 0xF8, and you have verified with Devel::Peek that the Perl variable contains the correct data² ("ø" is Unicode U+00F8), then the error message "[&#65533;ndby IF]", where &#65533; is the Unicode character "Replacement Character" (U+FFFD), indicates that there is indeed some strange Unicode stuff going on in the module, since you are opening the file with the correct encoding².

    * Update before posting: I see ablanke has investigated the code.

    ² Update: Actually, ikegami is right that the XML parser should be doing the decoding.

Re: Problem upgrading XML::Fast from 0.11 to 0.17
by ikegami (Patriarch) on Sep 22, 2017 at 17:18 UTC

    This is wrong:

    my $xml = do { local $/ = undef; open (my $fh, "<:encoding(ISO-8859-1)", $file) or die "Failed +to open $file - $!"; <$fh>; };

    It should be:

    my $xml = do { local $/ = undef; open (my $fh, "<:raw", $file) or die "Failed to open $file - $ +!"; <$fh>; };

    XML files are binary files (parsing the document is required to determine the encoding), not text files (files where the encoding is external to the document). It is the parser's job to handle decoding.

      Hi,
      It is the parser's job to handle decoding.

      To do so, the parser needs to know the encoding of the XML.

      The XML declaration (<?xml version="1.0" encoding="ISO-8859-1"?>) does provide that information for the parser.

      $xml =~ s/^(?:.*\n)//;    # remove first line - the encoding line

      By removing the XML declaration the Parser seems to guess the (wrong) source encoding. uses the default encoding.*

      With XML declaration your code seems to work correctly. Please notice that XML::Fast upgrades the data to utf8. But now in the correct manner.

      *updated

        By removing the XML declaration the Parser seems to guess the (wrong) source encoding.

        There's no guessing involved. If there's no encoding specified, then it must be UTF-8 to be valid XML.

        See my reply to ikegami. I was opening the file with ISO-8859-1 and removing the XML encoding line because of previous bugs in XML::Fast.

      Thanks ikegami. The reason I was doing it the way I was is because in 0.11 of XML::Fast I could only get correct decoded content by a) opening the file with ISO-8859-1 and b) removing the encoding line. I presume this was a bug in XML::Fast which is fixed now. Characters were being double encoded e.g., "Stade Gaston G\x{e9}rard" ended up being "State Gaston G\x{c3}\x{a9}rard".

      When I accidentally upgraded to 0.17 I forgot this was a workaround for problems in 0.11.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1199814]
Front-paged by Arunbear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (2)
As of 2024-04-19 20:47 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found