Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW

Re: Easily XML filtering?

by Jenda (Abbot)
on May 06, 2008 at 22:36 UTC ( #685084=note: print w/replies, xml ) Need Help??

in reply to Easily XML filtering?

use strict; use XML::Rules; my $parser = XML::Rules->new( rules => { _default => '', 'tags,to,keep' => 'raw', company => sub {$_[0] => $_[1]}, }, style => 'filter', ); $parser->filter(\*DATA); __END__ <root> <other><some>blah</some>foo</other> <company> <tags>xxx</tags> <tags>yyy</tags> <skip x="1">aaa</skip> </company> <company name="PerlSoft"> <tags>xxx</tags> <tags>yyy</tags> <skip x="1">aaa</skip> </company> </root>

Change the list of tags to keep and the name of the repeated company tag to whatever you need and you should be done ;-)

As you can see, the filtering is done only under the <company> tag, if you need to do it inside several tags, just specify their names separated by commas, just like the list of tags to accept. Keep in mind though that while processing the file the data inside each of the specified tags will be accumulated in case the rule (the anonymous subroutine) needed to make changes to it. So the contents of those tags should fit easily in memory. That's why I did not specify the subroutine rule for the root tag, but rather for the individual company.

Update: Of course you can do the insert(s)/update(s) at the same time as the filtering. You'd just have to specify what data do you want from what tag and how to include it in the datastructure being built and either copy them to the database in the rule for the company tag or even insert them once each tag that maps to a table is fully parsed and replace the data of that tag by just the ID to be used when copying the parent tag.

Replies are listed 'Best First'.
Re^2: Easily XML filtering?
by mattr (Curate) on May 07, 2008 at 18:10 UTC
    Wow that is really cool. Thanks! (And thank everyone else too!) Your update about doing db updates at same time as filtering is also pertinent.

    Yes I will have to study the module to see how to specify rules for subtrees nested inside the Company tag. My description wasn't so great. My data is basically a Feed of many Company twigs each wrapped in an Entity.

    <Feed> <Entity> <Company> <Identity> <Address>... </Identity> <Executives> <Section> <Executive> <Executive> </Section> ... </Company> </Entity> <Entity> <Company> ...
    and so on.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://685084]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others cooling their heels in the Monastery: (6)
As of 2020-01-20 15:22 GMT
Find Nodes?
    Voting Booth?