Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Easily XML filtering?

by mattr (Curate)
on May 06, 2008 at 17:52 UTC ( [id://685016]=perlquestion: print w/replies, xml ) Need Help??

mattr has asked for the wisdom of the Perl Monks concerning the following question:

Cherished Monks,

I have an XML file describing hundreds of companies that I'll pull down by ftp, and the provider may add new tags. Provider says to be sure to filter out new tags so your app doesn't crash. Okay, so I need to pull each company subtree off the feed, and I suppose for each company then I need to filter it in some way before putting the data in a database.

I don't have tons of memory so it seems I'd use XML::Twig or a SAX module to pull a singl company off the file at a time. Then it would be nice if I could just filter a company with a single command that I hand a brief template, and then validate to a DTD perhaps before putting the data in the database.

To filter it should I use SAX or maybe XSLT / XPathScript? Maybe I could validate to a DTD and just skip errors? Would Pyx be easiest? I can't tell if these tools can do what I want and documentation doesn't really cover my task.

I'd like to just simply list the tags I am expecting and have everything else filtered out. Conceptually I thought I ought to be able to specify a template xml file and use that as a filter. I'd like to not have to write tons of filter code as it seems like a common enough task, and undoubtedly libXML can do it but I can't find enough docs about that either.

Then I need to dump the company twig into a DBIx::Class based database that has tables and foreign keys set up as expected from the XML. Again, I wish there was a quick way to do it.

Thanks for your meditation on my task!

Matt R

Replies are listed 'Best First'.
Re: Easily XML filtering?
by Jenda (Abbot) on May 06, 2008 at 22:36 UTC
    use strict; use XML::Rules; my $parser = XML::Rules->new( rules => { _default => '', 'tags,to,keep' => 'raw', company => sub {$_[0] => $_[1]}, }, style => 'filter', ); $parser->filter(\*DATA); __END__ <root> <other><some>blah</some>foo</other> <company> <tags>xxx</tags> <tags>yyy</tags> <skip x="1">aaa</skip> </company> <company name="PerlSoft"> <tags>xxx</tags> <tags>yyy</tags> <skip x="1">aaa</skip> </company> </root>

    Change the list of tags to keep and the name of the repeated company tag to whatever you need and you should be done ;-)

    As you can see, the filtering is done only under the <company> tag, if you need to do it inside several tags, just specify their names separated by commas, just like the list of tags to accept. Keep in mind though that while processing the file the data inside each of the specified tags will be accumulated in case the rule (the anonymous subroutine) needed to make changes to it. So the contents of those tags should fit easily in memory. That's why I did not specify the subroutine rule for the root tag, but rather for the individual company.

    Update: Of course you can do the insert(s)/update(s) at the same time as the filtering. You'd just have to specify what data do you want from what tag and how to include it in the datastructure being built and either copy them to the database in the rule for the company tag or even insert them once each tag that maps to a table is fully parsed and replace the data of that tag by just the ID to be used when copying the parent tag.

      Wow that is really cool. Thanks! (And thank everyone else too!) Your update about doing db updates at same time as filtering is also pertinent.

      Yes I will have to study the module to see how to specify rules for subtrees nested inside the Company tag. My description wasn't so great. My data is basically a Feed of many Company twigs each wrapped in an Entity.

      <Feed> <Entity> <Company> <Identity> <Address>... </Identity> <Executives> <Section> <Executive> <Executive> </Section> ... </Company> </Entity> <Entity> <Company> ...
      and so on.
Re: Easily XML filtering?
by mr_mischief (Monsignor) on May 06, 2008 at 18:10 UTC
    If you're using any API that asks specifically for a particular tag by name to get its associated data, then "filtering" is really a passive thing. Just don't ask for the content of tags your app doesn't use.

    XML::Simple for example should allow you to grab your file, get the data you need, and simply ignore everything you don't need. Other modules that allow you to extract specific parts of the XML tree should work, too.

Re: Easily XML filtering?
by Herkum (Parson) on May 06, 2008 at 18:13 UTC

    XML::Twig will allow you to flush() the elements that you have finished processing, thereby allowing you to free up memory. Twig has a lot of documentation, but that is because it is really flexible. Take a serious look at it and it should be what you want.

      You're right I will

      Regards from Tokyo,

      Matt

Re: Easily XML filtering?
by gloryhack (Deacon) on May 06, 2008 at 19:46 UTC
    You might want to have a look at XML::Descent. It's a stream parser so it's (relatively) light on memory usage, and it makes it easy to decide what to keep and what to ignore. It's not quite as simple as listing the elements you want, but it can get pretty darn close to that depending upon how you employ the module.
      Thank you very much, will check it out.

      Matt

Re: Easily XML filtering?
by GrandFather (Saint) on May 06, 2008 at 22:48 UTC

    Set up twig handlers for them. Consider:

    use strict; use warnings; use XML::Twig; my $xml = <<XML; <companies> <Acme>Acme</Acme> <Missing>Missing</Missing> <Wobbler>Wobbler</Wobbler> <Trump>Trump</Trump> <Tibble>Tibble</Tibble> </companies> XML my @companyList = qw(Acme Wobbler Tibble); my %handlers = map {$_ => \&HandleCompany} @companyList; my $twig = XML::Twig->new (twig_handlers => \%handlers); $twig->parse ($xml); sub HandleCompany { print $_->text (), "\n"; }

    Prints:

    Acme Wobbler Tibble

    Perl is environmentally friendly - it saves trees
      Thank you very much, I'll study Twig too for sure.

      Matt

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://685016]
Approved by moritz
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others lurking in the Monastery: (5)
As of 2024-03-19 02:53 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found