http://www.perlmonks.org?node_id=11103629


in reply to parse XML huge file using cpan modules

There are generally two ways to handle XML: "LibXML," which uses an industry-standard binary library to turn the XML into an in-memory data structure, and "Twig," which walks through the data invoking subroutines along the way (but without reading it all into memory). Both solutions are known to work correctly with any XML data. If you're processing a terabyte XML file with gigabytes of memory – which sometimes happens – use Twig. It can do it.
  • Comment on Re: parse XML huge file using cpan modules

Replies are listed 'Best First'.
Re^2: parse XML huge file using cpan modules
by choroba (Cardinal) on Jul 30, 2019 at 20:40 UTC
    XML::LibXML can do it as well, as was shown here in this very thread. XML::LibXML::Reader works similarly to XML::Twig - it doesn't keep the whole data in memory, but if you tell it to, it can "inflate" a part of the data into a full-featured XML::LibXML™ object you can process using all the available methods.

    map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
Re^2: parse XML huge file using cpan modules
by Jenda (Abbot) on Jul 31, 2019 at 09:31 UTC

    Well, yes but no.

    There are more ways and often your hailed "industry-standard binary libraries" support several.

    You can use one of several libraries to load the whole file into memory as a huge maze of objects and then search and navigate the maze using methods and sublanguages like XPath.

    You can use one of several libraries to load the whole file into memory as a huge memory structure (possibly with a bit of tie() magic) and navigate it using normal Perl tools. You should NOT use XML::Simple for that 'cause it produces inconsistent data structures! If the data structure is your goal, then have a look at XML::Rules, it would allow you to produce a consistent structure and trim it along the way.

    You can use one of several libraries to have them call your handlers whenever they find another bit of whatever in the XML and take care of knowing where the heck you are in the structure yourself. Good luck with that! Industry standard or no industry standard. It's a mess.

    You can use one of several libraries to give you the next bit whenever you ask for it and take care of knowing where the heck you are in the structure yourself. Good luck with that! Industry standard or no industry standard. It's a mess.

    You can use XML::Twig to call your handler whenever it finishes parsing a reasonably large, easy to digest chunk of the XML (a twig) and have it provide you with the data from the twig either as a maze of objects or a data structure.

    You can use XML::Rules to call your handler whenever it finishes parsing a reasonably large, easy to digest chunk of the XML and have it provide you with the data from the chunk as a data structure built according to the rules you provided, handle or massage the data in any way you need and have the result made available to the handler of an enclosing chunk and thus either process the file as you go or build a modified, trimmed down data structure.

    Jenda
    Enoch was right!
    Enjoy the last years of Rome.

Re^2: parse XML huge file using cpan modules
by nicopelle (Acolyte) on Jul 31, 2019 at 08:47 UTC
    Thanks for your tips.