Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?

Having trouble with siblings

by madbee (Acolyte)
on Jun 30, 2013 at 08:22 UTC ( #1041556=perlquestion: print w/replies, xml ) Need Help??
madbee has asked for the wisdom of the Perl Monks concerning the following question:

Hello! Still trying to parse the same XML as before - but now having trouble with siblings. I have 2 versions of the xml file. Need to come up with a generic script to extract the values in either cases. Pasting the XML first and then my attempt.

When I run my script, it doesnt return any siblings. just 1.

XML File: Version 1 <root> <part> <sect> <header> 1. Purpose and rationale <P>purpose 1</p> <p>Purpose 2</p> <p>purpose 3</p> <l> <li>purpose list 1</li> <li>list2</li> </l> </header> </sect> </part> </root>

Objective:In this scenario, I need to search for Purpose and rationale. If found, need to extract all siblings between <header> and </header>

The same section Purpose and rationale can exist in another xml but in different format. The challenge is to have 1 generic script to handle both scenarios.

XML Scenario 2:

<root> <part> <sect> <header> 1. Purpose and rationale <P>purpose 1</p> <p>Purpose 2</p> <p>purpose 3</p> <l> <li>purpose list 1</li> <li>list2</li> </l> <p>2. Some other heading</p> <p>content 1</p> <p>content 2</p> </header> </sect> </part> </root>

For scenario 2, I need to extract all the siblings under Purpose and rationale only until "Some other heading". This heading title can change. The only identifier is that the node begins with a number.

In this xml content from 2 headers is mixed up. so i only need to extract the content from the siblings of the "purpose and rationale" section

My attempted code is below:

my $dom = XML::LibXML->new->parse_file($file); my $study_str = 'Purpose and rationale|Study purpose|Study ration +ale' for my $search ('/root/part/sect/header') { my $nodeset = $dom->find($search); foreach my $node($nodeset -> get_nodelist) { $node -> string_value; if ($node =~ m/$study_str/i) { my $protocol = $node; print $protocol,"\n"; #go to the next sibling while ($node -> { Node }) { if ($node -> { Node } -> getNextSibling ) { $node -> { Node } = $node -> getNextSibling; return $node -> { Node }; } } } } }

This only returns the value within the header tags and none of the children. obviously,i'm doing something wrong.Hoping for some help here to extract the content I need.

Thanks again for your help and apologies if the question is not clear

Regards, Madbee

Replies are listed 'Best First'.
Re: Having trouble with siblings
by poj (Parson) on Jun 30, 2013 at 14:27 UTC

    This extracts all the childnodes into an array. I've printed them out with numbers and delimiter so you can see them individually. I suggest you loop through them extracting all or any that you want.

    #!perl use strict; use XML::LibXML; my $dom = XML::LibXML->load_xml( IO => *DATA ); my $study_str = 'Purpose and rationale|Study purpose|Study rationale'; for my $search ('//header') { my $nodeset = $dom->find($search); foreach my $node ($nodeset->get_nodelist){ if ($node =~ m/$study_str/i){ my @childnodes = $node->nonBlankChildNodes(); my $n=1; for (@childnodes){ print '#'.$n++.'# '.$_->toString."##\n\n"; } } } } __DATA__ <root> <part><sect> <header> 1. Purpose and rationale doc 1 <p>purpose 1</p> <p>Purpose 2</p> <p>purpose 3</p> <ul> <li>purpose list 1</li> <li>list2</li> </ul> <p>2. Some other heading</p> <p>content 1</p> <p>content 2</p> </header> </sect></part> </root>
Re: Having trouble with siblings
by Anonymous Monk on Jun 30, 2013 at 14:34 UTC

      Aha, the node test node() will select all types of nodes, even text() nodes

      • so find a <header> which contains 'rationale'
      • and select every child node of that header
      • and filter this nodeset for nodes which have following-sibling with tagname  <l> or which have tagname  <l>
      xmllint.exe --xpath " //header[ contains(.,'rationale') ]/node()[ fol +lowing-sibling::l or self::l] " fudge 1. Purpose and rationale <p>purpose 1</p> <p>Purpose 2</p> <p>purpose 3</p> <l> <li>purpose list 1</li> <li>list2</li> </l> 2. Purpose and rationale <p>2 purpose 1</p> <p>2 Purpose 2</p> <p>2 purpose 3</p> <l> <li>2 purpose list 1</li> <li>2 list2</li> </l>

      Naturally if there are more than one  <l> you can limit to first <l> with position()  xmllint.exe --xpath " //header[ contains(.,'rationale') ]/node()[ following-sibling::l[ position()=1 ] or self::l[ position()=1 ]  ]   " fudge

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1041556]
Approved by frozenwithjoy
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others taking refuge in the Monastery: (10)
As of 2016-10-26 10:21 GMT
Find Nodes?
    Voting Booth?
    How many different varieties (color, size, etc) of socks do you have in your sock drawer?

    Results (340 votes). Check out past polls.