http://www.perlmonks.org?node_id=1104326


in reply to Best XML library to validate XML from untrusted source

I don't see how XXE relates to doing XML validation that couldn't be addressed by limiting memory and CPU usage (which you would need to do either way), but you can nullify it by using XML::LibXML's load_ext_dtd and expand_entities options (as mentioned in the document you linked, under the heading "libxml2").

Rather than loading the entire document into memory, you'd want to use XML::LibXML's pull interface, XML::LibXML::Reader.

$ cat a.xml <?xml version="1.0" encoding="ISO-8859-1"?> <!DOCTYPE foo [ <!ELEMENT foo ANY > <!ENTITY xxe SYSTEM "file:///etc/passwd" >]><foo>&xxe;</foo> $ perl -MXML::LibXML::Reader -e' my $reader = XML::LibXML::Reader->new( location => $ARGV[0], load_ext_dtd => 0, expand_entities => 0, ); while ($reader->read) { printf("%d %d %s\n", $reader->depth, $reader->nodeType, $reader->name, ); } ' a.xml 0 10 foo 0 1 foo 1 5 xxe 0 15 foo
$ cat bad.xml <foo><bar></foo> $ perl -MXML::LibXML::Reader -e' exit 1 if !eval { my $reader = XML::LibXML::Reader->new( location => $ARGV[0], load_ext_dtd => 0, expand_entities => 0, ); 1 while $reader->read; 1 }; ' bad.xml $ if [ $? -eq 0 ]; then echo "well-formed" ; else echo "error" ; fi error

Replies are listed 'Best First'.
Re^2: Best XML library to validate XML from untrusted source
by Jenda (Abbot) on Oct 20, 2014 at 15:07 UTC

    XML::LibXML::Reader is way too low-level and while the pull style tends to lead to a (very slightly) more readable code than ordinary, node-level push, it's still nothing I would dare to recommend ... to anyone.

    XML::Rules and XML::Twig give you the file in bite sized chunks which IMNSHO works much better than forcing a decomposition to individual atoms.

    Speaking of XML::Rules ... it's based on XML::Parser::Expat and allows setting its handlers so I think setting the Expat's ExternEnt to your handler should provide vsespb with the protection he's after.

    Jenda
    Enoch was right!
    Enjoy the last years of Rome.

      I have no idea why you wouldn't recommend

      use XML::LibXML::Reader qw( ); my $reader = XML::LibXML::Reader->new( location => $file_or_url, load_ext_dtd => 0, expand_entities => 0, ); 1 while $reader->read;

      Wrapping this up just so you get something you can call higher-level simply is pure waste.

        Say, because it doesn't do anything? I mean, yes, it does some kind of basic format validation, but once you actually need to extract some data out of the file, things start getting complicated very quickly.

        Jenda
        Enoch was right!
        Enjoy the last years of Rome.

Re^2: Best XML library to validate XML from untrusted source
by vsespb (Chaplain) on Oct 23, 2014 at 12:14 UTC
    Thank you for all your replies! Very useful. One note:
    I don't see how XXE relates to doing XML validation that couldn't be addressed by limiting memory and CPU usage (which you would need to do either way),
    XXE not just about DoS. For example I have API which accepts requests over XML.
    There is an API function: create object (with user supplied name). And another function: list all objects with its names.
    So attacker can create object with name equal to content of /etc/passwd and then list it, this way receive content of /etc/passwd.
    imho, pretty common case...