http://www.perlmonks.org?node_id=1199923


in reply to Problem upgrading XML::Fast from 0.11 to 0.17

This is wrong:

my $xml = do { local $/ = undef; open (my $fh, "<:encoding(ISO-8859-1)", $file) or die "Failed +to open $file - $!"; <$fh>; };

It should be:

my $xml = do { local $/ = undef; open (my $fh, "<:raw", $file) or die "Failed to open $file - $ +!"; <$fh>; };

XML files are binary files (parsing the document is required to determine the encoding), not text files (files where the encoding is external to the document). It is the parser's job to handle decoding.

Replies are listed 'Best First'.
Re^2: Problem upgrading XML::Fast from 0.11 to 0.17
by ablanke (Monsignor) on Sep 22, 2017 at 19:26 UTC
    Hi,
    It is the parser's job to handle decoding.

    To do so, the parser needs to know the encoding of the XML.

    The XML declaration (<?xml version="1.0" encoding="ISO-8859-1"?>) does provide that information for the parser.

    $xml =~ s/^(?:.*\n)//;    # remove first line - the encoding line

    By removing the XML declaration the Parser seems to guess the (wrong) source encoding. uses the default encoding.*

    With XML declaration your code seems to work correctly. Please notice that XML::Fast upgrades the data to utf8. But now in the correct manner.

    *updated

      By removing the XML declaration the Parser seems to guess the (wrong) source encoding.

      There's no guessing involved. If there's no encoding specified, then it must be UTF-8 to be valid XML.

      See my reply to ikegami. I was opening the file with ISO-8859-1 and removing the XML encoding line because of previous bugs in XML::Fast.

Re^2: Problem upgrading XML::Fast from 0.11 to 0.17
by mje (Curate) on Sep 25, 2017 at 11:53 UTC

    Thanks ikegami. The reason I was doing it the way I was is because in 0.11 of XML::Fast I could only get correct decoded content by a) opening the file with ISO-8859-1 and b) removing the encoding line. I presume this was a bug in XML::Fast which is fixed now. Characters were being double encoded e.g., "Stade Gaston G\x{e9}rard" ended up being "State Gaston G\x{c3}\x{a9}rard".

    When I accidentally upgraded to 0.17 I forgot this was a workaround for problems in 0.11.