Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses

Search and Extract from XML when path is unknown

by madbee (Acolyte)
on Jul 10, 2013 at 05:18 UTC ( #1043413=perlquestion: print w/replies, xml ) Need Help??
madbee has asked for the wisdom of the Perl Monks concerning the following question:

Greetings Perl Monks!

Tasked with developing a search and extract process using Perl/XML. Not my forte,but thanks to you and perldocs,am able to manage. For standard XML's like this,my process worked like a charm. (Tried both approaches: XML::LibXML and Anonymous Monk's Xpath approach and got the results as expected) But the issue is with heterogenous xml files where the structure of each file is different and the path to the content where I have to search and extract for is unknown.

<root> <part> <sect> <toc>1.1 Design Purpose...</toc> </sect> <sect> <sect> <header2>Purpose</header2> <tag>3.3 Design Purpose</tag> <tag> Design purpose and description </tag> <tag>This is a design XZY document for Project </ta +g> <tag>design specification details </tag> <tag>3.4 application purpose</tag> <tag> app details </tag> <tag> more app details</tag> </header> </sect> </part> </root>

Given an XML as above.Say I have to search and extract the section "Design Purpose".

1. I only know that every document definitely has a root and part tags. The structure of every document is different from the other. 2 out of 10 docs may have a similar structure. Unless I manually review each, I wouldnt know which are similar and which are not. There are hunderds of docs that need to be processed

2. Content in each document is nested under multiple nodes- the complete or partial path or even the nodes where the content I am looking for is unknown. It can be as I showed in the example or can be in any other form.

3. Content is replicated in multiple sections including TOC and Bookmarks.Ignoring these may be easier. But if replicated in sections other than TOC and Bookmarks, I need to identify and extract the exact section.

4. Need to extract only the child nodes belonging to the content I am trying to extract. i.e And also need to only retrieve the child nodes of Design Purpose only and nothing after the "Application Purpose" nodes.

Without knowing where in the document that Section is, under what nodes and what tags it can possibly be held by i.e without knowing the partial or full path to the content, will it be possible to develop a generic search and extract process using Perl and XML?

Will something like this work?

1. Search for string.Get its node. So,if I'm searching for Design Purpose: Find <tag>

2. Get all the parent nodes until it hits the root.

3. Build the path dynamically. Using the constructed path, say: //root//part//sect//tag from the above example: extract the child elements using XPath or XML::LibXML

Am I on the right track with this? Any pointers to how this can be done?

Can this task be done easily using Perl/RegEx parsing on text files rather than XML files?

I have to add that some of these XML files are not even Trees -they are flat flanked by Tags. All these were created by using PDF-Save As XML

Appreciate any thoughts in this regard. Apologies in advance if the post is not very clear.

Thanks in advance, madbee

Replies are listed 'Best First'.
Re: Search and Extract from XML when path is unknown
by McA (Priest) on Jul 10, 2013 at 06:52 UTC


    I would recommend to also have a look at XML::Parser to solve the issue a little bit more event-driven. You're scanning tags and when the content seems to be the starting point you work on the remaining tags and their content as long as you don't leave the "interesting" subtree . In this case you don't need to know how deep the structure is nested you're looking for.

    Best regards

Re: Search and Extract from XML when path is unknown
by Anonymous Monk on Jul 10, 2013 at 07:19 UTC

    Am I on the right track with this?

    Kinda but not really. There is no need to look up to parent to construct/generate an xpath for subsequent search, xpath/xml is about paths, .. means parent, just like in filepaths

    Any pointers to how this can be done?

    :) I gave you plenty of examples, how you don't have to name the tags you're looking for, how to find a node with text you want, and all siblings until some other tag ... maybe you need something stronger than hints :)?

    Can this task be done easily using Perl/RegEx parsing on text files rather than XML files?

    No. Consider your questions up to this one (walk tree); now imagine you also have to build that tree (libxml job) -- that's a lot more work

    Also forget about XML::Parser its too low level. If you're tempted by that approach use XML::Twig :) it comes with many examples/tutorials /

Re: Search and Extract from XML when path is unknown
by sundialsvc4 (Abbot) on Jul 10, 2013 at 12:13 UTC

    First of all, I would definitely say, use XPath expressions to do everything.   Use a libxml2-based Perl package.

    Second, you might need to take a “branch and bound” type of approach:   use one set of expressions to carve the total data structure into big-chunks that you can iterate through, and then, within each subtree that you find, use other XPaths to find nodes-of-interest.   Having found one, your Perl logic may need to do some tree-walking to see if “these are the droids you’re looking for.”   But, let XPath do as much of the work for you as possible.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1043413]
Approved by hdb
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others exploiting the Monastery: (4)
As of 2017-11-18 09:13 GMT
Find Nodes?
    Voting Booth?
    In order to be able to say "I know Perl", you must have:

    Results (277 votes). Check out past polls.