Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

[SOLVED] XML::Twig - Parsing XML file with incorrect encoding in declaration

by ateague (Monk)
on Sep 01, 2017 at 22:31 UTC ( [id://1198553]=perlquestion: print w/replies, xml ) Need Help??

ateague has asked for the wisdom of the Perl Monks concerning the following question:

UPDATE:
I went ahead and used the brute-force nuclear option and manually edited the declaration before passing it to XML::Twig.

open my $fh, '+<:utf8', 'file.in.xml' or die $!; my $line = <$fh>; $line=~s/<\?xml.+encoding="\Kutf-16"/utf-8" / or die "didn't match line: '$line'"; seek $fh,0,0 or die $!; print $fh $line; close $fh;

Thank you haukex for your help!




Good afternoon!

I am trying to use XML::Twig to strip out comments in a file provided by an upstream process. Unfortunately, the upstream process is incorrectly marking the encoding as "utf-16" when it is not. This causes XML::Twig (and XML::Parser) to fail with a encoding specified in XML declaration is incorrect at line 1, column 30, byte 30 error

Is there an option in XML::Twig that can be set to "relax" the parsing to ignore the incorrect encoding specified in the declaration?

Thank you for your time.

Perl info:

perl -v This is perl 5, version 24, subversion 0 (v5.24.0) built for MSWin32-x +64-multi-thread
XML::Twig info: 3.52

Sample code:
#!/usr/bin/perl use 5.024; use strict; use warnings; use XML::Twig; open (my $OFILE, '>:utf8', 'file.out.xml') or die "$!\n$^E"; my $t = XML::Twig->new( twig_handlers => { '/keys/key' => sub { $_[0]->flush($OFILE); }, }, output_encoding => 'utf-8', pretty_print => 'indented', comments => 'drop', # remove any comments ); $t->safe_parse(\*DATA); if ( $@ ) { die "Error occured in XML data\n\n$@"; } close $OFILE; __DATA__ <?xml version="1.0" encoding="utf-16"?> <keys> <!-- One hen --> <key>45646fa8-32e5-494c-93ff-0f00281fc2d6</key> <!-- Two ducks --> <key>b6bdc46f-3275-4312-bbbd-3e375208d05f</key> <!-- Three squawking geese --> <key>e5a37cf0-1f69-41a8-899c-23454600894a</key> <!-- Four limerick oysters --> <key>b6287f3d-f70c-498d-8360-5a2d8e863ab3</key> <!-- Five corpulent porpoises --> <key>118be380-5e69-47d4-81c6-756c34334936</key> <!-- Six pair of Don Alverzo's tweezers --> <key>46f9dd5b-d0e9-4f8f-a559-f698bea561fa</key> <!-- Seven thousand Macedonians in full battle array --> <key>9627058f-29f0-4263-8978-fc77ac2fe0a3</key> <!-- Eight brass monkeys from the ancient sacred crypts of Egypt --> <key>6038d393-ba81-423e-8429-01406779ff9e</key> <!-- Nine apathetic, sympathetic, diabetic old men on roller skates, with a marked propensity towards procrastination and sloth --> <key>5a67c3f0-ea6f-427c-bc3a-86fdb31fd117</key> <!-- Ten lyrical, spherical, diabolical denizens of the deep who stalk about the corners of the cove all at the same time. --> <key>7ac8b1d8-ff60-4b55-8fe0-ea809d9f5b02</key> </keys>

Replies are listed 'Best First'.
Re: XML::Twig - Parsing XML file with incorrect encoding in declaration
by holli (Abbot) on Sep 02, 2017 at 06:44 UTC
    XML::Twig only looks at the xml-declaration and complains about a wrong encoding if it (the declararion) is there. If it isn't, it parses the xml just fine and doesn't care about the encoding. So, if you just read the first line from the handle you read from before passing it to the parse method you are fine. In your case:
    # ... yadda my $encodingLine = <DATA>; print "Ommiting <$encodingLine>\n"; $t->safe_parse(\*DATA); # bumm ...


    holli

    You can lead your users to water, but alas, you cannot drown them.
Re: XML::Twig - Parsing XML file with incorrect encoding in declaration
by haukex (Archbishop) on Sep 02, 2017 at 08:38 UTC

    Although I normally don't like editing XML files with anything other than a proper module, in this case it might be appropriate. Also, luckily, the <?xml...?> declaration happens right at the top of the file. The following will work if the only change you are making is "utf-16" to "utf-8", obviously it won't work if the name of the target encoding is longer...

    open my $fh, '+<:utf8', 'file.in.xml' or die $!; my $line = <$fh>; $line=~s/<\?xml.+encoding="\Kutf-16"/utf-8" / or die "didn't match line: '$line'"; seek $fh,0,0 or die $!; print $fh $line; close $fh;

      Thanks for the tip

      I used a variation of this for the final processing to get the job out the door while I discuss the XML declaration issues with the Department of XML Generation Department

Re: XML::Twig - Parsing XML file with incorrect encoding in declaration
by NetWallah (Canon) on Sep 01, 2017 at 23:30 UTC
      Can you try slurping in the XML content, and using perl to zap the encoding, before feeding to XML::Twig ?

      That's what I figured I would need to do (reading in 128 MiB chunks in this case; 73 GiB is a bit much to slurp in all at once ;) ). I just wanted to see if there was an "official" method before I broke out the full brute-force nuclear option.

      (Of course it probably goes without saying that the ideal solution is to fix this fluster-cluck of an XML travesty...)

      EDIT:

      I just saw the SO link in your update after I posted my reply. That particular example will not work in this because the data is not UTF-16 even though the encoding says otherwise.

        If it is too big to slurp .. you could "pipe-filter" it through.

        I haven't tried this, but you should be able to setup a FIFO , and pass a handle to LibXML to read.

        Then read your XML file, a record at a time, filter out the encoding garbage, and feed it to the FIFO.

        Probably need 2 threads to make this work in a single program.

                        All power corrupts, but we need electricity.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1198553]
Approved by Paladin
Front-paged by kcott
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others admiring the Monastery: (4)
As of 2024-04-19 00:45 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found