http://www.perlmonks.org?node_id=1100255

DreamT has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I'm mainly used to work with XML::Simple, so my XML parsing skills are what you can call "novice" ;-)
However, now I have problems since I need to process a quite large file (12.2 mb) , and XML::Simple croaked with a "killed" message. I've also tried XML::Bare, worked great on my local computer, but on the server it also croaked with "Segmentation fault".
I suspect that the file is too large for these modules to process.
So, here are my questions:
1. Do you know how I can "tweak" the modules above to optimize the performance? 2. If not, what other module can do the job? I tried XML::Parser, but frankly I didn't find a good method browse the data - I simply didn't "get" how to use it in a good way:) (I'm used to access the data in the way that XML::Simple/XML::Bare serves it)

Example data below. I want to browse each -product- to fetch -product_id- and loop over -attributes- to get the values of each -attribute- tag.
<?xml version="1.0" encoding="ISO-8859-1"?> <feed> <timestamp>Thu, 11 Sep 2014 08:58:59 +0200</timestamp> <language_product_id>sv</language_product_id> <products> <product> <product_id>ABC123</product_id> <name>My product</name> <attributes> <attribute> <group> <id>1507</id> <name>Engines</name> </group> <value> <id>301</id> <value>Generator</value> </value> </attribute> <attribute> <group> <id>1561</id> <name>Längd (i mm)</name> </group> <value> <id></id> <value>2625</value> </value> </attribute> <attribute> <group> <id>1498</id> <name>Year model</name> </group> <value> <id></id> <value>01.1994</value> </value> </attribute> <attribute> <group> <id>1518</id> <name>Year model (to)</name> </group> <value> <id></id> <value>12.1998</value> </value> </attribute> <attribute> <group> <id>12033</id> <name>Vehicle equipment</name> </group> <value> <id>12019</id> <value>Maybe</value> </value> </attribute> </attributes> <references /> </product> <product> <product_id>XYZ789</product_id> <name>My product</name> <attributes> <attribute> <group> <id>1507</id> <name>Engines</name> </group> <value> <id>301</id> <value>Generator</value> </value> </attribute> <attribute> <group> <id>1498</id> <name>Year model</name> </group> <value> <id></id> <value>01.1985</value> </value> </attribute> <attribute> <group> <id>1518</id> <name>Year model (to)</name> </group> <value> <id></id> <value>12.1992</value> </value> </attribute> </attributes> <references /> </product> </products> </feed>

Replies are listed 'Best First'.
Re: Easy XML-parser that can handle large file?
by Corion (Patriarch) on Sep 11, 2014 at 07:22 UTC

    As alternative to XML::Rules, there also is XML::Twig, which basically is the same but a little different. It requires XML::Parser as prerequisite.

    If you want to see all the Xpath expressions that occur in an XML file, here's a program I use to find the structure of an XML file in absence of an XSD:

    #!perl use strict; use XML::Twig; my %path; sub handle_tag { my( $twig )= @_; my $tag= $_; my $path= $tag->path( ); print $path, "\n" unless $path{ $path }++; $tag->purge; }; my $twig=XML::Twig->new( twig_handlers => { _all_ => \&handle_tag, }, ); $twig->parsefile( $ARGV[0] ); print "\n-------\n\n"; for my $k (sort keys %path) { print "$k\t$path{ $k }\n"; };
Re: Easy XML-parser that can handle large file?
by Discipulus (Canon) on Sep 11, 2014 at 07:32 UTC
      .. i'm so slow in responding...
      This is my best with your data (surely can be improved): UPDATE: the code was broken, updated...
      my $t= XML::Twig->new( pretty_print => 'indented', twig_handlers => { 'product'=>sub{ my @pname = $_[1]->get_x +path('name'); my @pids = $_[1]->get_xp +ath('product_id'); print $pids[0]->text," - + ",$pname[0]->text,"\n"; my %h; my @ids = $_[1]->get_xpa +th('attributes/attribute/group/id'); my @names = $_[1]->get_x +path('attributes/attribute/group/name'); @h{map {$_->text} @ids } + = map {$_->text} @names ; my @vids = $_[1]->get_xp +ath('attributes/attribute/value/id'); my @values = $_[1]->get_ +xpath('attributes/attribute/value/value'); @h{map {$_->text} @vids +} = map {$_->text} @values ; print map {"\t$_ - $h{$_ +}\n"} keys %h; print "\n\n"; } } ); $t->parse($xml); ####OUTPUT ABC123 - My product - 12.1998 1561 - Lõngd (i mm) 1507 - Engines 1498 - Year model 12033 - Vehicle equipment 12019 - Maybe 1518 - Year model (to) 301 - Generator XYZ789 - My product - 12.1992 1507 - Engines 1498 - Year model 1518 - Year model (to) 301 - Generator
      HtH
      L*

      There are no rules, there are no thumbs..
      Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.

        Nice.

        In case you, or the reader, don't know: in handlers $_ is aliased to $_[1], so you can write ,$_->get_xpath(...) instead of $_[1]->get_xpath(...). Beyond saving 3 characters each time, I am used to $_ meaning "the current element" within a handler, and I find it easier to read.

Re: Easy XML-parser that can handle large file? ( XML::Rules )
by Anonymous Monk on Sep 11, 2014 at 07:13 UTC
      To use XML::LibXML on large files, use the pull parser XML::LibXML::Reader. For example, the following script

      produces the following output:

      ABC123 group [ 1507 : Engines ] value [ 301 : Generator ] group [ 1561 : Längd (i mm) ] value [ : 2625 ] group [ 1498 : Year model ] value [ : 01.1994 ] group [ 1518 : Year model (to) ] value [ : 12.1998 ] group [ 12033 : Vehicle equipment ] value [ 12019 : Maybe ] XYZ789 group [ 1507 : Engines ] value [ 301 : Generator ] group [ 1498 : Year model ] value [ : 01.1985 ] group [ 1518 : Year model (to) ] value [ : 12.1992 ]
      لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ
Re: Easy XML-parser that can handle large file?
by jellisii2 (Hermit) on Sep 11, 2014 at 11:51 UTC
    Long live mirod! May $DEITY bless his name!
    Long live twig! Helping XML be sane!

    I may be a tiny bit biased...