Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

Easy XML-parser that can handle large file?

by DreamT (Pilgrim)
on Sep 11, 2014 at 07:08 UTC ( #1100255=perlquestion: print w/replies, xml ) Need Help??

DreamT has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I'm mainly used to work with XML::Simple, so my XML parsing skills are what you can call "novice" ;-)
However, now I have problems since I need to process a quite large file (12.2 mb) , and XML::Simple croaked with a "killed" message. I've also tried XML::Bare, worked great on my local computer, but on the server it also croaked with "Segmentation fault".
I suspect that the file is too large for these modules to process.
So, here are my questions:
1. Do you know how I can "tweak" the modules above to optimize the performance? 2. If not, what other module can do the job? I tried XML::Parser, but frankly I didn't find a good method browse the data - I simply didn't "get" how to use it in a good way:) (I'm used to access the data in the way that XML::Simple/XML::Bare serves it)

Example data below. I want to browse each -product- to fetch -product_id- and loop over -attributes- to get the values of each -attribute- tag.
<?xml version="1.0" encoding="ISO-8859-1"?> <feed> <timestamp>Thu, 11 Sep 2014 08:58:59 +0200</timestamp> <language_product_id>sv</language_product_id> <products> <product> <product_id>ABC123</product_id> <name>My product</name> <attributes> <attribute> <group> <id>1507</id> <name>Engines</name> </group> <value> <id>301</id> <value>Generator</value> </value> </attribute> <attribute> <group> <id>1561</id> <name>Längd (i mm)</name> </group> <value> <id></id> <value>2625</value> </value> </attribute> <attribute> <group> <id>1498</id> <name>Year model</name> </group> <value> <id></id> <value>01.1994</value> </value> </attribute> <attribute> <group> <id>1518</id> <name>Year model (to)</name> </group> <value> <id></id> <value>12.1998</value> </value> </attribute> <attribute> <group> <id>12033</id> <name>Vehicle equipment</name> </group> <value> <id>12019</id> <value>Maybe</value> </value> </attribute> </attributes> <references /> </product> <product> <product_id>XYZ789</product_id> <name>My product</name> <attributes> <attribute> <group> <id>1507</id> <name>Engines</name> </group> <value> <id>301</id> <value>Generator</value> </value> </attribute> <attribute> <group> <id>1498</id> <name>Year model</name> </group> <value> <id></id> <value>01.1985</value> </value> </attribute> <attribute> <group> <id>1518</id> <name>Year model (to)</name> </group> <value> <id></id> <value>12.1992</value> </value> </attribute> </attributes> <references /> </product> </products> </feed>

Replies are listed 'Best First'.
Re: Easy XML-parser that can handle large file?
by Corion (Pope) on Sep 11, 2014 at 07:22 UTC

    As alternative to XML::Rules, there also is XML::Twig, which basically is the same but a little different. It requires XML::Parser as prerequisite.

    If you want to see all the Xpath expressions that occur in an XML file, here's a program I use to find the structure of an XML file in absence of an XSD:

    #!perl use strict; use XML::Twig; my %path; sub handle_tag { my( $twig )= @_; my $tag= $_; my $path= $tag->path( ); print $path, "\n" unless $path{ $path }++; $tag->purge; }; my $twig=XML::Twig->new( twig_handlers => { _all_ => \&handle_tag, }, ); $twig->parsefile( $ARGV[0] ); print "\n-------\n\n"; for my $k (sort keys %path) { print "$k\t$path{ $k }\n"; };
Re: Easy XML-parser that can handle large file?
by Discipulus (Abbot) on Sep 11, 2014 at 07:32 UTC
      .. i'm so slow in responding...
      This is my best with your data (surely can be improved): UPDATE: the code was broken, updated...
      my $t= XML::Twig->new( pretty_print => 'indented', twig_handlers => { 'product'=>sub{ my @pname = $_[1]->get_x +path('name'); my @pids = $_[1]->get_xp +ath('product_id'); print $pids[0]->text," - + ",$pname[0]->text,"\n"; my %h; my @ids = $_[1]->get_xpa +th('attributes/attribute/group/id'); my @names = $_[1]->get_x +path('attributes/attribute/group/name'); @h{map {$_->text} @ids } + = map {$_->text} @names ; my @vids = $_[1]->get_xp +ath('attributes/attribute/value/id'); my @values = $_[1]->get_ +xpath('attributes/attribute/value/value'); @h{map {$_->text} @vids +} = map {$_->text} @values ; print map {"\t$_ - $h{$_ +}\n"} keys %h; print "\n\n"; } } ); $t->parse($xml); ####OUTPUT ABC123 - My product - 12.1998 1561 - Lõngd (i mm) 1507 - Engines 1498 - Year model 12033 - Vehicle equipment 12019 - Maybe 1518 - Year model (to) 301 - Generator XYZ789 - My product - 12.1992 1507 - Engines 1498 - Year model 1518 - Year model (to) 301 - Generator
      HtH
      L*

      There are no rules, there are no thumbs..
      Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.

        Nice.

        In case you, or the reader, don't know: in handlers $_ is aliased to $_[1], so you can write ,$_->get_xpath(...) instead of $_[1]->get_xpath(...). Beyond saving 3 characters each time, I am used to $_ meaning "the current element" within a handler, and I find it easier to read.

Re: Easy XML-parser that can handle large file? ( XML::Rules )
by Anonymous Monk on Sep 11, 2014 at 07:13 UTC
      To use XML::LibXML on large files, use the pull parser XML::LibXML::Reader. For example, the following script

      produces the following output:

      ABC123 group [ 1507 : Engines ] value [ 301 : Generator ] group [ 1561 : Längd (i mm) ] value [ : 2625 ] group [ 1498 : Year model ] value [ : 01.1994 ] group [ 1518 : Year model (to) ] value [ : 12.1998 ] group [ 12033 : Vehicle equipment ] value [ 12019 : Maybe ] XYZ789 group [ 1507 : Engines ] value [ 301 : Generator ] group [ 1498 : Year model ] value [ : 01.1985 ] group [ 1518 : Year model (to) ] value [ : 12.1992 ]
      لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ
Re: Easy XML-parser that can handle large file?
by jellisii2 (Hermit) on Sep 11, 2014 at 11:51 UTC
    Long live mirod! May $DEITY bless his name!
    Long live twig! Helping XML be sane!

    I may be a tiny bit biased...

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1100255]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others examining the Monastery: (7)
As of 2019-08-18 10:02 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    If you were the first to set foot on the Moon, what would be your epigram?






    Results (134 votes). Check out past polls.

    Notices?