http://www.perlmonks.org?node_id=1100261


in reply to Easy XML-parser that can handle large file?

Hello DreamT

I was in your same situation 2 years ago, facing for the first time the shaggy thing that XML is..
I tried many modules starting with the XML::Simple, which name was intriguing. Here in the monastery there are at many parties about XML parsing: XML::Parser XML::LibXML XML::Rules XML::XSH2 (a wrapper around XML::LibXML) and XML::Twig

I finally choosed XML::Twig and naw i'm very happy of the choice.

The central problem is the ability to parse XML by chunks, instead of reading the whole file. This feature (shared by best modules) let you to parse huge files without memory problems.

XML::Twig has many resources and maaany method to parse XML. You can find infos on CPAN or in the home site of Twig wher you find also good tutorials

so, 1) forget XML::Simple, 2) choose a module of those suggested or jump directly on XML::Twig here some sparse links about Perl and XML

http://www.effectiveperlprogramming.com/2011/07/rewrite-xml-with-xmltwig/
http://www.effectiveperlprogramming.com/2010/03/process-xml-data-with-xmltwig/
http://it-is-etc.blogspot.it/2012/07/perl-how-to-manipulate-xml-files-using.html
http://perlmeme.org/tutorials/parsing_xml.html
speed comparison http://www.robinclarke.net/archives/xml-parsing-with-perl
http://www.xml.com/pub/a/2001/03/21/xmltwig.html
ambrus's Do not reinvent the wheel: real-world example using XML::Twig and also http://perl-xml.sourceforge.net/faq/ and choroba about XML


HtH
L*

There are no rules, there are no thumbs..
Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.

Replies are listed 'Best First'.
Re^2: Easy XML-parser that can handle large file?
by Discipulus (Canon) on Sep 11, 2014 at 08:20 UTC
    .. i'm so slow in responding...
    This is my best with your data (surely can be improved): UPDATE: the code was broken, updated...
    my $t= XML::Twig->new( pretty_print => 'indented', twig_handlers => { 'product'=>sub{ my @pname = $_[1]->get_x +path('name'); my @pids = $_[1]->get_xp +ath('product_id'); print $pids[0]->text," - + ",$pname[0]->text,"\n"; my %h; my @ids = $_[1]->get_xpa +th('attributes/attribute/group/id'); my @names = $_[1]->get_x +path('attributes/attribute/group/name'); @h{map {$_->text} @ids } + = map {$_->text} @names ; my @vids = $_[1]->get_xp +ath('attributes/attribute/value/id'); my @values = $_[1]->get_ +xpath('attributes/attribute/value/value'); @h{map {$_->text} @vids +} = map {$_->text} @values ; print map {"\t$_ - $h{$_ +}\n"} keys %h; print "\n\n"; } } ); $t->parse($xml); ####OUTPUT ABC123 - My product - 12.1998 1561 - Lġngd (i mm) 1507 - Engines 1498 - Year model 12033 - Vehicle equipment 12019 - Maybe 1518 - Year model (to) 301 - Generator XYZ789 - My product - 12.1992 1507 - Engines 1498 - Year model 1518 - Year model (to) 301 - Generator
    HtH
    L*

    There are no rules, there are no thumbs..
    Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.

      Nice.

      In case you, or the reader, don't know: in handlers $_ is aliased to $_[1], so you can write ,$_->get_xpath(...) instead of $_[1]->get_xpath(...). Beyond saving 3 characters each time, I am used to $_ meaning "the current element" within a handler, and I find it easier to read.

        a 'nice' from the module author... i'm honored... ;=)

        I never noticed this feature of $_ set to $_[1] (or well, i used incosciously..)

        May be worth to add some line in the Synopsis:
        para => sub { $_[1]->set_tag( 'p') }, # change para to p (handlers + receive $twig and $element as argouments) para => sub { $_->set_tag( 'p') }, # change para to p ($_ is al +iased to $_[1] for convenience ) ###and in the corpus of the docs: $_ is also set to the element (ie: $_[1]), so it is easy to write inli +ne handlers like


        L*
        There are no rules, there are no thumbs..
        Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.