Easy XML-parser that can handle large file?

DreamT has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I'm mainly used to work with XML::Simple, so my XML parsing skills are what you can call "novice" ;-)
However, now I have problems since I need to process a quite large file (12.2 mb) , and XML::Simple croaked with a "killed" message. I've also tried XML::Bare, worked great on my local computer, but on the server it also croaked with "Segmentation fault".
I suspect that the file is too large for these modules to process.
So, here are my questions:
1. Do you know how I can "tweak" the modules above to optimize the performance? 2. If not, what other module can do the job? I tried XML::Parser, but frankly I didn't find a good method browse the data - I simply didn't "get" how to use it in a good way:) (I'm used to access the data in the way that XML::Simple/XML::Bare serves it)

Example data below. I want to browse each -product- to fetch -product_id- and loop over -attributes- to get the values of each -attribute- tag.

<?xml version="1.0" encoding="ISO-8859-1"?>
<feed>
  <timestamp>Thu, 11 Sep 2014 08:58:59 +0200</timestamp>
  <language_product_id>sv</language_product_id>
  <products>
    <product>
      <product_id>ABC123</product_id>
      <name>My product</name>
      <attributes>
        <attribute>
          <group>
            <id>1507</id>
            <name>Engines</name>
          </group>
          <value>
            <id>301</id>
            <value>Generator</value>
          </value>
        </attribute>
        <attribute>
          <group>
            <id>1561</id>
            <name>Längd (i mm)</name>
          </group>
          <value>
            <id></id>
            <value>2625</value>
          </value>
        </attribute>
        <attribute>
          <group>
            <id>1498</id>
            <name>Year model</name>
          </group>
          <value>
            <id></id>
            <value>01.1994</value>
          </value>
        </attribute>
        <attribute>
          <group>
            <id>1518</id>
            <name>Year model (to)</name>
          </group>
          <value>
            <id></id>
            <value>12.1998</value>
          </value>
        </attribute>
        <attribute>
          <group>
            <id>12033</id>
            <name>Vehicle equipment</name>
          </group>
          <value>
            <id>12019</id>
            <value>Maybe</value>
          </value>
        </attribute>
      </attributes>
      <references />
    </product>
    <product>
      <product_id>XYZ789</product_id>
      <name>My product</name>
      <attributes>
        <attribute>
          <group>
            <id>1507</id>
            <name>Engines</name>
          </group>
          <value>
            <id>301</id>
            <value>Generator</value>
          </value>
        </attribute>
        <attribute>
          <group>
            <id>1498</id>
            <name>Year model</name>
          </group>
          <value>
            <id></id>
            <value>01.1985</value>
          </value>
        </attribute>
        <attribute>
          <group>
            <id>1518</id>
            <name>Year model (to)</name>
          </group>
          <value>
            <id></id>
            <value>12.1992</value>
          </value>
        </attribute>
      </attributes>
      <references />
    </product>
  </products>
</feed>
[download]

Comment on Easy XML-parser that can handle large file? Download Code

Replies are listed 'Best First'.
Re: Easy XML-parser that can handle large file? by Corion (Patriarch) on Sep 11, 2014 at 07:22 UTC
As alternative to XML::Rules, there also is XML::Twig, which basically is the same but a little different. It requires XML::Parser as prerequisite. If you want to see all the Xpath expressions that occur in an XML file, here's a program I use to find the structure of an XML file in absence of an XSD: `#!perl use strict; use XML::Twig; my %path; sub handle_tag { my( $twig )= @_; my $tag= $_; my $path= $tag->path( ); print $path, "\n" unless $path{ $path }++; $tag->purge; }; my $twig=XML::Twig->new( twig_handlers => { _all_ => \&handle_tag, }, ); $twig->parsefile( $ARGV[0] ); print "\n-------\n\n"; for my $k (sort keys %path) { print "$k\t$path{ $k }\n"; };` [download]	[reply] [d/l]
Re: Easy XML-parser that can handle large file? by Discipulus (Canon) on Sep 11, 2014 at 07:32 UTC
Hello DreamT I was in your same situation 2 years ago, facing for the first time the shaggy thing that XML is.. I tried many modules starting with the XML::Simple, which name was intriguing. Here in the monastery there are at many parties about XML parsing: XML::Parser XML::LibXML XML::Rules XML::XSH2 (a wrapper around XML::LibXML) and XML::Twig I finally choosed XML::Twig and naw i'm very happy of the choice. The central problem is the ability to parse XML by chunks, instead of reading the whole file. This feature (shared by best modules) let you to parse huge files without memory problems. XML::Twig has many resources and maaany method to parse XML. You can find infos on CPAN or in the home site of Twig wher you find also good tutorials so, 1) forget XML::Simple, 2) choose a module of those suggested or jump directly on XML::Twig here some sparse links about Perl and XML http://www.effectiveperlprogramming.com/2011/07/rewrite-xml-with-xmltwig/ http://www.effectiveperlprogramming.com/2010/03/process-xml-data-with-xmltwig/ http://it-is-etc.blogspot.it/2012/07/perl-how-to-manipulate-xml-files-using.html http://perlmeme.org/tutorials/parsing_xml.html speed comparison http://www.robinclarke.net/archives/xml-parsing-with-perl http://www.xml.com/pub/a/2001/03/21/xmltwig.html ambrus's Do not reinvent the wheel: real-world example using XML::Twig and also http://perl-xml.sourceforge.net/faq/ and choroba about XML HtH L* There are no rules, there are no thumbs.. Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.	[reply] [d/l]
Re^2: Easy XML-parser that can handle large file? by Discipulus (Canon) on Sep 11, 2014 at 08:20 UTC
.. i'm so slow in responding... This is my best with your data (surely can be improved): UPDATE: the code was broken, updated... my $t= XML::Twig->new( pretty_print => 'indented', twig_handlers => { 'product'=>sub{ my @pname = $_[1]->get_x +path('name'); my @pids = $_[1]->get_xp +ath('product_id'); print $pids[0]->text," - + ",$pname[0]->text,"\n"; my %h; my @ids = $_[1]->get_xpa +th('attributes/attribute/group/id'); my @names = $_[1]->get_x +path('attributes/attribute/group/name'); @h{map {$_->text} @ids } + = map {$_->text} @names ; my @vids = $_[1]->get_xp +ath('attributes/attribute/value/id'); my @values = $_[1]->get_ +xpath('attributes/attribute/value/value'); @h{map {$_->text} @vids +} = map {$_->text} @values ; print map {"\t$_ - $h{$_ +}\n"} keys %h; print "\n\n"; } } ); $t->parse($xml); ####OUTPUT ABC123 - My product - 12.1998 1561 - Lõngd (i mm) 1507 - Engines 1498 - Year model 12033 - Vehicle equipment 12019 - Maybe 1518 - Year model (to) 301 - Generator XYZ789 - My product - 12.1992 1507 - Engines 1498 - Year model 1518 - Year model (to) 301 - Generator [download] HtH L* There are no rules, there are no thumbs.. Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.	[reply] [d/l]
Re^3: Easy XML-parser that can handle large file? by mirod (Canon) on Sep 12, 2014 at 05:37 UTC
Nice. In case you, or the reader, don't know: in handlers `$_` is aliased to `$_[1]`, so you can write ,`$_->get_xpath(...)` instead of `$_[1]->get_xpath(...)`. Beyond saving 3 characters each time, I am used to `$_` meaning "the current element" within a handler, and I find it easier to read.	[reply] [d/l] [select]
Re^4: Easy XML-parser that can handle large file? by Discipulus (Canon) on Sep 12, 2014 at 07:42 UTC
Re^5: Easy XML-parser that can handle large file? by mirod (Canon) on Sep 12, 2014 at 08:30 UTC
Re: Easy XML-parser that can handle large file? ( XML::Rules ) by Anonymous Monk on Sep 11, 2014 at 07:13 UTC
instead of XML::Simple use XML::Rules, see more about xml rules Then there is XML::Twig Quick Reference and XML::LibXML HTML::TreeBuilder::XPath or XML::LibXML with tools like xpather.pl/htmltreexpather.pl which can give you paths to start with, and all the links here Re: Retrieve select information from HTML, they're examples(for tree-xpath and others)/walkthroughs/tutorials ...	[reply]
Re^2: Easy XML-parser that can handle large file? by choroba (Cardinal) on Sep 11, 2014 at 07:52 UTC
To use XML::LibXML on large files, use the pull parser XML::LibXML::Reader. For example, the following script Read more... (1325 Bytes) produces the following output: `ABC123 group [ 1507 : Engines ] value [ 301 : Generator ] group [ 1561 : Längd (i mm) ] value [ : 2625 ] group [ 1498 : Year model ] value [ : 01.1994 ] group [ 1518 : Year model (to) ] value [ : 12.1998 ] group [ 12033 : Vehicle equipment ] value [ 12019 : Maybe ] XYZ789 group [ 1507 : Engines ] value [ 301 : Generator ] group [ 1498 : Year model ] value [ : 01.1985 ] group [ 1518 : Year model (to) ] value [ : 12.1992 ]` [download] لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ	[reply] [d/l] [select]
Re: Easy XML-parser that can handle large file? by jellisii2 (Hermit) on Sep 11, 2014 at 11:51 UTC
Long live mirod! May $DEITY bless his name! Long live twig! Helping XML be sane! I may be a tiny bit biased...	[reply]

Back to Seekers of Perl Wisdom