Having trouble with siblings

madbee has asked for the wisdom of the Perl Monks concerning the following question:

Hello! Still trying to parse the same XML as before - but now having trouble with siblings. I have 2 versions of the xml file. Need to come up with a generic script to extract the values in either cases. Pasting the XML first and then my attempt.

When I run my script, it doesnt return any siblings. just 1.

XML File: Version 1
<root>
        <part>
            <sect>
               <header>
                   1. Purpose and rationale
                   <P>purpose 1</p>
                   <p>Purpose 2</p>
                   <p>purpose 3</p>
                   <l>
                      <li>purpose list 1</li>
                      <li>list2</li>
                    </l>                      
                </header>
            </sect>
         </part>
     </root>
[download]

Objective:In this scenario, I need to search for Purpose and rationale. If found, need to extract all siblings between <header> and </header>

The same section Purpose and rationale can exist in another xml but in different format. The challenge is to have 1 generic script to handle both scenarios.

XML Scenario 2:

 
 <root>
        <part>
            <sect>
               <header>
                   1. Purpose and rationale
                   <P>purpose 1</p>
                   <p>Purpose 2</p>
                   <p>purpose 3</p>
                   <l>
                      <li>purpose list 1</li>
                      <li>list2</li>
                    </l>                      
                  <p>2. Some other heading</p>
                  <p>content 1</p>
                  <p>content 2</p>
                </header>
            </sect>
         </part>
     </root>
[download]

For scenario 2, I need to extract all the siblings under Purpose and rationale only until "Some other heading". This heading title can change. The only identifier is that the node begins with a number.

In this xml content from 2 headers is mixed up. so i only need to extract the content from the siblings of the "purpose and rationale" section

My attempted code is below:

     my $dom = XML::LibXML->new->parse_file($file);
     my $study_str = 'Purpose and rationale|Study purpose|Study ration
+ale'
    for my $search ('/root/part/sect/header') {
        my $nodeset = $dom->find($search);
    foreach my $node($nodeset -> get_nodelist)
    {
        $node -> string_value;

        if ($node =~ m/$study_str/i)
        {
          my $protocol = $node;
          print $protocol,"\n";
         #go to the next sibling
         while ($node -> { Node }) {
        if ($node -> { Node } -> getNextSibling ) {
         $node -> { Node } = $node -> getNextSibling;
        return $node -> { Node };
        }
        }
        }
}
}
[download]

This only returns the value within the header tags and none of the children. obviously,i'm doing something wrong.Hoping for some help here to extract the content I need.

Thanks again for your help and apologies if the question is not clear

Regards, Madbee

Comment on Having trouble with siblings Select or Download Code

Replies are listed 'Best First'.
Re: Having trouble with siblings by poj (Abbot) on Jun 30, 2013 at 14:27 UTC
This extracts all the childnodes into an array. I've printed them out with numbers and delimiter so you can see them individually. I suggest you loop through them extracting all or any that you want. #!perl use strict; use XML::LibXML; my $dom = XML::LibXML->load_xml( IO => *DATA ); my $study_str = 'Purpose and rationale\|Study purpose\|Study rationale'; for my $search ('//header') { my $nodeset = $dom->find($search); foreach my $node ($nodeset->get_nodelist){ if ($node =~ m/$study_str/i){ my @childnodes = $node->nonBlankChildNodes(); my $n=1; for (@childnodes){ print '#'.$n++.'# '.$_->toString."##\n\n"; } } } } __DATA__ <root> <part><sect> <header> 1. Purpose and rationale doc 1 <p>purpose 1</p> <p>Purpose 2</p> <p>purpose 3</p> <ul> <li>purpose list 1</li> <li>list2</li> </ul> <p>2. Some other heading</p> <p>content 1</p> <p>content 2</p> </header> </sect></part> </root> [download] poj	[reply] [d/l]
Re: Having trouble with siblings by Anonymous Monk on Jun 30, 2013 at 14:34 UTC
Same tricks from Re: Get Node Value from irregular XML (xpather.pl) `"//header[ contains(.,'rationale') ]/* "` `xmllint.exe --xpath " //header[ contains(.,'rationale') ]/l /preceding +-sibling::* " fudge xmllint.exe --xpath " //header[ contains(.,'rationale') ]/child::text( +) " fudge` [download] where fudge is your data part=1 part=2 Read more... (2 kB) The two queries combined, with the results Read more... (901 Bytes)	[reply] [d/l] [select]
Re^2: Having trouble with siblings by Anonymous Monk on Jun 30, 2013 at 15:34 UTC
Aha, the node test node() will select all types of nodes, even text() nodes so find a <header> which contains 'rationale' and select every child node of that header and filter this nodeset for nodes which have following-sibling with tagname `<l>` or which have tagname `<l>` `xmllint.exe --xpath " //header[ contains(.,'rationale') ]/node()[ fol +lowing-sibling::l or self::l] " fudge 1. Purpose and rationale <p>purpose 1</p> <p>Purpose 2</p> <p>purpose 3</p> <l> <li>purpose list 1</li> <li>list2</li> </l> 2. Purpose and rationale <p>2 purpose 1</p> <p>2 Purpose 2</p> <p>2 purpose 3</p> <l> <li>2 purpose list 1</li> <li>2 list2</li> </l>` [download] Naturally if there are more than one `<l>` you can limit to first `<l>` with position() `xmllint.exe --xpath " //header[ contains(.,'rationale') ]/node()[ following-sibling::l[ position()=1 ] or self::l[ position()=1 ] ] " fudge`	[reply] [d/l] [select]

Back to Seekers of Perl Wisdom