Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

The best way to handle different type of XML files

by mahira (Acolyte)
on Nov 21, 2009 at 12:27 UTC ( #808576=perlquestion: print w/ replies, xml ) Need Help??
mahira has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks!,

Currently I am working on a project that will handle several XML files from different sources with different formats (using XML::Simple)

I am trying to handle all of them with a single piece of code but it is hard because nodes are different, depth is different etc.

I am looking for advise. Is XML::Simple the right tool ?

Thanks,

Mahir

ps: for those who interested I want to give a little more detail; the project is about gathering XML based product data from different vendors and integrating them into our own MySQL db... We have been successfull so far but since it becomes harder to manage different vendors with different scripts, I am trying to reach a more generic solution...

Comment on The best way to handle different type of XML files
Re: The best way to handle different type of XML files
by toolic (Chancellor) on Nov 21, 2009 at 13:30 UTC

      XML::Simple can't even handle elements that can repeat a variable number of times without making a mess. You need to hold it's hand or you won'tbe able to predict the structure you'll get. It's much less trouble to get a consistent tree every time. That's why I use (the much faster) XML::LibXML.

      As for the variations between the document formats, it becomes a question of using the right XPath for the document in question. You can use a lookup table for that.

      my $xpath = $xpath_by_doctype{$doctype};

      Other solutions based on XPath (like XML::Twig) should do fine too.

        I've found XML::Simple is really useful for prototyping a solution, frequently with either a dumbed-down schema or a subset of a real document. It lets me stub out the XML bit of the program so I can get the rest of the logic flowing.

        After that, I usually end up re-writing that code to XML::XPath or its ilk.

        I prefer to use XML::Simple when possible, because it takes a lot less care and feeding in the simplest cases. But like you pointed out, it blows up pretty quickly once the document is more than trivial.

        Of course, it sounds like the OP has gotten past the prototype stage already. Frankly, it sounds like the project got a lot farther on XML::Simple than I would have expected.

Re: The best way to handle different type of XML files
by Tanktalus (Canon) on Nov 21, 2009 at 15:20 UTC

    XML::Twig + XPath (either the XPath built-in to XML::Twig, or, if you need a bit more, XML::XPath will work with twig objects). Suddenly, you won't have to care about depth, just names. You may have to care about multiple names for a given value, such as //middle//name, or just //name, or include attributes or whatever, but it's all hierarchies of names, regardless of depth.

    Of course, it's always nice to have a standard format for all your vendors to use, but, unless you're Walmart, good luck with that. :-)

Re: The best way to handle different type of XML files
by 7stud (Deacon) on Nov 21, 2009 at 16:38 UTC

    Currently I am working on a project that will handle several XML files from different sources with different formats. I am trying to handle all of them with a single piece of code but it is hard because nodes are different, depth is different etc.

    I don't see how that is possible. If you have one XML file that has a tag called <super_duper_product> nested inside one other tag, and another XML file that has a tag called <item89001> nested inside three tags, I don't think there is any way you can use the same script to extract both tags. There has got to be some pattern you can exploit, either the tags have similar names, or they have similar locations in the document tree, or they have identical siblings or child elements, or similar text. Something.

    As the first responders noted, XPath makes it easy to find a specific tag name anywhere in the document. XPath lets you treat an XML file as if it were a directory on your file system. You locate elements using path notation: /bookstore/book/title. XPath conveniently lets you omit the first 'directory', like this:

    findnodes("//<book>");

    which searches for all <book> tags anywhere in the document. The LibXML module provides that findnodes method which allows you to specify an XPath for the search.

Re: The best way to handle different type of XML files
by grantm (Parson) on Nov 23, 2009 at 00:12 UTC
Re: The best way to handle different type of XML files
by pajout (Curate) on Nov 23, 2009 at 13:11 UTC
    It is hard to choose some tool, because you never know all formats which you have to process... It is my experience of very similar situation.

    The crucial question is "How to implement my logic on various, mostly unpredictable data structures?". I think that XML::Simple is good for the simplest examples. It needs some experience with that tools, but consider XML::Twig, XML::Rules, tools performing xslt transformation or more generally, using XPath lang, iterating DOM structure, iterating object structure of XML::Trivial (my kid :) or, for instance, more esoteric STX language, http://stx.sourceforge.net/ .

    Principially, you can spare some work using some "scripting" language, which is oriented for xml processing (xslt, stx, xpath), but you can loose some power. And oppositely using Perl, iterating through Perl data/object representation of the document.

    Another problem could be processing of huge documents. In this case XML::Twig or stx can work for you, but consider raw processing of some XML::Parser output too.

Re: The best way to handle different type of XML files
by tfrayner (Curate) on Nov 24, 2009 at 08:59 UTC
    I'm also a fan of XML::LibXML, but I'd also put in a suggestion that you may want to consider combining it with XML::LibXSLT. I've recently been through a similar situation (although I didn't have the luxury of only dealing with XML as input), and I found that life became much simpler when I developed a single unified XML schema that closely reflected my MySQL database table structure, for which I could then easily write a database loader module. At that point it became fairly straightforward to write XSLT stylesheet documents to convert the myriad input XML formats into the database-compliant schema format. It's possible this approach is a bit over-engineered compared to the XML::Simple approach, but then again it is much easier to maintain in the long run.

    Best of luck,

    Tim

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://808576]
Approved by biohisham
Front-paged by Tanktalus
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others studying the Monastery: (4)
As of 2014-07-12 22:18 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    When choosing user names for websites, I prefer to use:








    Results (241 votes), past polls