Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

Best XML library to validate XML from untrusted source

by vsespb (Chaplain)
on Oct 19, 2014 at 10:52 UTC ( [id://1104296]=perlquestion: print w/replies, xml ) Need Help??

vsespb has asked for the wisdom of the Perl Monks concerning the following question:

Seems XML::Simple cannot be used for XML from untrusted source (see here and here )
Looking for some way to pre-process/validate XML to make sure that XXE is not used there, before feeding it to XML::Simple, and it's ok for me to use another XML module for that purpose (The other possible way to drop XML::Simple and just use that another module, but I don't want to rewrite whole code to use new API/data structures).
  • Comment on Best XML library to validate XML from untrusted source

Replies are listed 'Best First'.
Re: Best XML library to validate XML from untrusted source
by ikegami (Patriarch) on Oct 19, 2014 at 17:54 UTC

    I don't see how XXE relates to doing XML validation that couldn't be addressed by limiting memory and CPU usage (which you would need to do either way), but you can nullify it by using XML::LibXML's load_ext_dtd and expand_entities options (as mentioned in the document you linked, under the heading "libxml2").

    Rather than loading the entire document into memory, you'd want to use XML::LibXML's pull interface, XML::LibXML::Reader.

    $ cat a.xml <?xml version="1.0" encoding="ISO-8859-1"?> <!DOCTYPE foo [ <!ELEMENT foo ANY > <!ENTITY xxe SYSTEM "file:///etc/passwd" >]><foo>&xxe;</foo> $ perl -MXML::LibXML::Reader -e' my $reader = XML::LibXML::Reader->new( location => $ARGV[0], load_ext_dtd => 0, expand_entities => 0, ); while ($reader->read) { printf("%d %d %s\n", $reader->depth, $reader->nodeType, $reader->name, ); } ' a.xml 0 10 foo 0 1 foo 1 5 xxe 0 15 foo
    $ cat bad.xml <foo><bar></foo> $ perl -MXML::LibXML::Reader -e' exit 1 if !eval { my $reader = XML::LibXML::Reader->new( location => $ARGV[0], load_ext_dtd => 0, expand_entities => 0, ); 1 while $reader->read; 1 }; ' bad.xml $ if [ $? -eq 0 ]; then echo "well-formed" ; else echo "error" ; fi error

      XML::LibXML::Reader is way too low-level and while the pull style tends to lead to a (very slightly) more readable code than ordinary, node-level push, it's still nothing I would dare to recommend ... to anyone.

      XML::Rules and XML::Twig give you the file in bite sized chunks which IMNSHO works much better than forcing a decomposition to individual atoms.

      Speaking of XML::Rules ... it's based on XML::Parser::Expat and allows setting its handlers so I think setting the Expat's ExternEnt to your handler should provide vsespb with the protection he's after.

      Jenda
      Enoch was right!
      Enjoy the last years of Rome.

        I have no idea why you wouldn't recommend

        use XML::LibXML::Reader qw( ); my $reader = XML::LibXML::Reader->new( location => $file_or_url, load_ext_dtd => 0, expand_entities => 0, ); 1 while $reader->read;

        Wrapping this up just so you get something you can call higher-level simply is pure waste.

      Thank you for all your replies! Very useful. One note:
      I don't see how XXE relates to doing XML validation that couldn't be addressed by limiting memory and CPU usage (which you would need to do either way),
      XXE not just about DoS. For example I have API which accepts requests over XML.
      There is an API function: create object (with user supplied name). And another function: list all objects with its names.
      So attacker can create object with name equal to content of /etc/passwd and then list it, this way receive content of /etc/passwd.
      imho, pretty common case...
Re: Best XML library to validate XML from untrusted source
by kennethk (Abbot) on Oct 19, 2014 at 15:11 UTC
    From XML::Simple:

    The use of this module in new code is discouraged. Other modules are available which provide more straightforward and consistent interfaces. In particular, XML::LibXML is highly recommended.

    The major problems with this module are the large number of options and the arbitrary ways in which these options interact - often with unexpected results.

    Patches with bug fixes and documentation fixes are welcome, but new features are unlikely to be added.

    Essentially, the author has declared it broken by design. My understanding that the general advice these days is to use XML::Twig or XML::LibXML. And am not sure how vulnerable they are to untrusted sources.

    I'm aware that this is a bit off point, and you specifically didn't want to recraft the code to use a new API, but....


    #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.

      Ok, Tested XML::LibXML - it is vulnerable.

        I think you're supposed to disable external requests using various constructor parameters (as in XML::LibXML::Parser.pod.

        I presume that ext_ent_handler with your own callback to handle external entities would be enough, but I would still use or no_network to be on the safe(r) side.

      Tested XML::Twig - seems vulnerable as well.
Re: Best XML library to validate XML from untrusted source
by ikegami (Patriarch) on Oct 19, 2014 at 17:26 UTC
    Note: XML::Simple doesn't do any parsing. It uses one of many other parsers.
      Yes, I tested with default parser (some of SAX). Problem that there is no way to control behaviour with untrusted data.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1104296]
Approved by Perlbotics
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others learning in the Monastery: (8)
As of 2024-04-19 09:02 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found