Extracting HTML content between the h tags

vagabonding electron has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,
I parse a certain amount of HTML pages ( > 400 ) which have a structure as shown in the <DATA> part of the script below.
The relevant part of the page begins with <div id="bodyContent"> so that I put this part only in the script.
What I need is the text between the certain <h2>-tags.
I used HTML:TreeBuilder:XPath but I did not find how I could formulate an intersection there (e.g. following of <h2>[1] and preceding of <h2>[2] at the same time).
As a workaround I take the preceding-sibling in sequence of <h2>[i] tags, stringify the output and use substr to subtract the preceding chunks of text.
This works (after some clean up) but the code looks no fun to me.
Please give me a hint how I could make it better.
Thank you!
VE

#!/usr/bin/perl
use strict;
use warnings;

use LWP::Simple;
use HTML::TreeBuilder::XPath;

my $page;
$page .= $_ while <DATA>;
my $p = HTML::TreeBuilder::XPath->new_from_content( $page );

my @page_content =$p->findnodes( '//div[@id="bodyContent"]' );

for my $content ( @page_content )
{
    my @preface = $content->findvalues( './h2[1]/preceding-sibling::*'
+ );
        my $preface_text;
        my ( $keyword, $actualised );
        for my $pref ( @preface )
        {
            # $pref =~ s/^\s*(\S+)/$1/;
            $preface_text .= $pref;
            
            # print $preface_text, "--\n";
            
            ( undef, $keyword ) = split /:\s*?/, $pref, 2 if $pref =~ 
+/^\s*?Key words/;
            ( undef, $actualised ) = split /:\s*?/, $pref, 2 if $pref 
+=~ /^Actualised/;
        }
        print $keyword, "\n";
        print $actualised, "\n";
        
        my @problems = $content->findvalues( './h2[2]/preceding-siblin
+g::*' );
        my $probl;
        $probl .= $_ for @problems;
        $probl = substr( $probl, length( $preface_text) );
        
        print $probl, "\n";
        
        my @solution_1 = $content->findvalues( './h2[3]/preceding-sibl
+ing::*' );
        my $sol;
        $sol .= $_ for @solution_1;
        $sol = substr( $sol, length( $preface_text ) + length( $probl 
+) );
        print $sol, "\n";
        
        my @solution_2 = $content->findvalues( './h2[4]/preceding-sibl
+ing::*' );
        my $sol_2;
        $sol_2 .= $_ for @solution_2;
        $sol_2 = substr( $sol_2, length( $preface_text ) + length( $pr
+obl ) + length( $sol ) );
        print $sol_2 , "\n";
        
}

__DATA__
<head>
</head>
<body>
<div id="bodyContent">
        <!-- start content -->
<p>Key words: Some words. 
</p><p>Date:  2012-01-16
</p><p>Actualised: 2008-01-08 
</p><p>Commented: 05.06.2007
</p><p>Encoded: Some code.  
</p>
<h2> <span class="mw-headline" id="Problem"> Problem </span></h2>
<p>Problem description.
</p><p>Another description.
</p>
<h2> <span class="mw-headline" id="Solution1"> Solution 1 </span></h2>
<p>Solution description.
</p>
<h2> <span class="mw-headline" id="Solution2"> Solution 2 </span></h2>
<p>Solution description.
</p>
<h2> <span class="mw-headline" id="Comment"> Comment. </span></h2>
<p>Text of the comment.
</p><p><br />
</p>
</div>
<hr />
</body>
[download]

Comment on Extracting HTML content between the h tags Select or Download Code

Replies are listed 'Best First'.
Re: Extracting HTML content between the h tags by Gangabass (Vicar) on Aug 05, 2012 at 12:55 UTC
Something like this (just for first `p`s): `my @nodes = $p->findnodes('//h2[2]/preceding-sibling::p[preceding-sibl +ing::h2[1]]');` [download]	[reply] [d/l] [select]
Re^2: Extracting HTML content between the h tags by vagabonding electron (Curate) on Aug 05, 2012 at 14:22 UTC
Thank you a lot! I did not know this syntax. One more question if I dare :-) In about 10 pages the last h2-tag is missing, so that I used the following workaround: `my @solution_2 = $content->findvalues( './h2[4]/preceding-sibling::' +); unless ( @solution_2 ) { @solution_2 = $content->findvalues( '//hr/preceding-sibling::' ); }` [download] I tried the same with your syntax as: `@solution_2 = $content->findvalues( '//hr/preceding-sibling::p[precedi +ng-sibling::h2[3]]' );` [download] but I get an uninitialized value only. I understood the syntax so: "search the siblings but stop if the tag in brackets appears". Is this correct? If so, what am I doing false with the above attempt? Spasibo!	[reply] [d/l] [select]
Re^3: Extracting HTML content between the h tags by Gangabass (Vicar) on Aug 06, 2012 at 01:01 UTC
According to your HTML `preceding-sibling` for `hr` will be `div` tag but not `p` tag... So this code will find all `p`s after last `h2`: `$p->findnodes('//h2[4]/following-sibling::p');` [download] Or (more flexible): `$p->findnodes('//h2[last()]/following-sibling::p');` [download]	[reply] [d/l] [select]
Re^4: Extracting HTML content between the h tags by vagabonding electron (Curate) on Aug 06, 2012 at 15:31 UTC
Re: Extracting HTML content between the h tags by flexvault (Monsignor) on Aug 05, 2012 at 13:39 UTC
vagabonding electron, Just another way to look at the problem: #!/usr/bin/perl use strict; use warnings; my $start = 0; my $h2 = 0; my $keyword = ""; my $end = 0; while ( my $Content = <DATA> ) { chomp( $Content ); my $content = lc( $Content ); if ( $start == 0 ) { $start = index ( $content, '<div id="bodyContent">' ); } else { if ( $h2 == 0 ) { my $h = index ( $content, '<h2>' ); if ( $h >= 0 ) { $h2++; $keyword = substr( $Content, $h+4 ); my $tmp = + lc ( $keyword ); $end = index ( $tmp, '</h2>' ); if ( $end >= 0 ) { $keyword = substr( $keyword, 0, $end ); print "$keyword\n\n"; $h2 = 0; $keyword = ""; } } } else { $end = index ( $content, '</h2>' ); if ( $end >= 0 ) { $keyword .= substr( $Content, 0, $end ); print "$keyword\n\n"; $h2 = 0; $keyword = ""; } else { $keyword .= $Content; } } } } __DATA__ <head> </head> <body> <div id="bodyContent"> <!-- start content --> <p>Key words: Some words. </p><p>Date: 2012-01-16 </p><p>Actualised: 2008-01-08 </p><p>Commented: 05.06.2007 </p><p>Encoded: Some code. </p> <h2> <span class="mw-headline" id="Problem"> Problem </span></h2> <p>Problem description. </p><p>Another description. </p> <h2> <span class="mw-headline" id="Solution1"> Solution 1 </span></h2> <p>Solution description. </p> <h2> <span class="mw-headline" id="Solution2"> Solution 2 </span></h2> <p>Solution description. </p> <h2> <span class="mw-headline" id="Comment"> Comment. </span> </h2> <p>Text of the comment. </p><p><br /> </p> </div> <hr /> </body> [download] When working with HTML, you can't be sure that everything is lined up correctly. Notice I put some end-of lines before the last </h2> just to make sure it could handle multiple lines. Another approach would be to process the HTML and extract the <h2>...</h2> into an array and then process the array to eliminate span, fonts, etc after you have complete information in each element of the array. Good Luck! "Well done is better than well said." - Benjamin Franklin	[reply] [d/l]
Re^2: Extracting HTML content between the h tags by vagabonding electron (Curate) on Aug 05, 2012 at 14:34 UTC
flexvault Thank you very much! Since I have read a lot that one should not parse a HTML without a module I did not try this before either :-) I will certainly check this approach out. I think it could be difficult in case that the last hr-tag is missing (described in Re^3: Extracting HTML content between the h tags ). Thanks again!	[reply]
Re^3: Extracting HTML content between the h tags by flexvault (Monsignor) on Aug 05, 2012 at 15:14 UTC
p>vagabonding electron, For the missing hr-tag, just test for $keyword after the 'while' loop: `if ( $keyword ) ## Same as if ( $keyword ne "" ) { print "$keyword\n"; }` [download] Whenever I start a new project/gig, I try to think whether this is similar to something I've done before, and if it is, then I use that code or technique as the starting point. If it is totally new (very rare), I still have a bag of tricks ( subroutines ) that I copy ( use ... ) into the new work. Look at every thing you do today as something you may be able to use for the rest of your programming life. You're lucky to have Perl, since a lot of the code I did before Perl is worthless today, but knowledge and techniques can be applied to Perl! Good Luck! "Well done is better than well said." - Benjamin Franklin	[reply] [d/l]
Re: Extracting HTML content between the h tags by Anonymous Monk on Aug 05, 2012 at 12:37 UTC
perl htmltreexpather.pl fudge.html ------------------------------------------------------------------ HTML::Element=HASH(0xb338e4) 0.1.0.0 Key words: Some words. /html/body/div/p //div[@id='bodyContent']/p //div[@id='bodyContent']/p ------------------------------------------------------------------ HTML::Element=HASH(0xb33944) 0.1.0.1 Date: 2012-01-16 /html/body/div/p[2] //div[@id='bodyContent']/p[2] //div[@id='bodyContent']/p[2] ------------------------------------------------------------------ HTML::Element=HASH(0xb339b4) 0.1.0.2 Actualised: 2008-01-08 /html/body/div/p[3] //div[@id='bodyContent']/p[3] //div[@id='bodyContent']/p[3] ------------------------------------------------------------------ HTML::Element=HASH(0xb33a24) 0.1.0.3 Commented: 05.06.2007 /html/body/div/p[4] //div[@id='bodyContent']/p[4] //div[@id='bodyContent']/p[4] ------------------------------------------------------------------ HTML::Element=HASH(0xb33a94) 0.1.0.4 Encoded: Some code. /html/body/div/p[5] //div[@id='bodyContent']/p[5] //div[@id='bodyContent']/p[5] ------------------------------------------------------------------ HTML::Element=HASH(0xb33b14) 0.1.0.5 Problem /html/body/div/h2 //div[@id='bodyContent']/h2 //div[@id='bodyContent']/h2 ------------------------------------------------------------------ HTML::Element=HASH(0xb33bd4) 0.1.0.5.0 Problem /html/body/div/h2/span //span[@id='Problem'] //span[@id='Problem'] ------------------------------------------------------------------ HTML::Element=HASH(0xb33bc4) 0.1.0.6 Problem description. /html/body/div/p[6] //div[@id='bodyContent']/p[6] //div[@id='bodyContent']/p[6] ------------------------------------------------------------------ HTML::Element=HASH(0xb33c64) 0.1.0.7 Another description. /html/body/div/p[7] //div[@id='bodyContent']/p[7] //div[@id='bodyContent']/p[7] ------------------------------------------------------------------ HTML::Element=HASH(0xb33ce4) 0.1.0.8 Solution 1 /html/body/div/h2[2] //div[@id='bodyContent']/h2[2] //div[@id='bodyContent']/h2[2] ------------------------------------------------------------------ HTML::Element=HASH(0xb33da4) 0.1.0.8.0 Solution 1 /html/body/div/h2[2]/span //span[@id='Solution1'] //span[@id='Solution1'] ------------------------------------------------------------------ HTML::Element=HASH(0xb33d94) 0.1.0.9 Solution description. /html/body/div/p[8] //div[@id='bodyContent']/p[8] //div[@id='bodyContent']/p[8] ------------------------------------------------------------------ HTML::Element=HASH(0xb33e44) 0.1.0.10 Solution 2 /html/body/div/h2[3] //div[@id='bodyContent']/h2[3] //div[@id='bodyContent']/h2[3] ------------------------------------------------------------------ HTML::Element=HASH(0xb33f04) 0.1.0.10.0 Solution 2 /html/body/div/h2[3]/span //span[@id='Solution2'] //span[@id='Solution2'] ------------------------------------------------------------------ HTML::Element=HASH(0xb33f44) 0.1.0.11 Solution description. /html/body/div/p[9] //div[@id='bodyContent']/p[9] //div[@id='bodyContent']/p[9] ------------------------------------------------------------------ HTML::Element=HASH(0xb33fb4) 0.1.0.12 Comment. /html/body/div/h2[4] //div[@id='bodyContent']/h2[4] //div[@id='bodyContent']/h2[4] ------------------------------------------------------------------ HTML::Element=HASH(0xb34074) 0.1.0.12.0 Comment. /html/body/div/h2[4]/span //span[@id='Comment'] //span[@id='Comment'] ------------------------------------------------------------------ HTML::Element=HASH(0xb34064) 0.1.0.13 Text of the comment. /html/body/div/p[10] //div[@id='bodyContent']/p[10] //div[@id='bodyContent']/p[10] ------------------------------------------------------------------ ################################################################## [download] Hmm, so I would use the stack approach, ie find `q{ //div[@id='bodyContent']/* }` [download] everything before first h2 tag is key/value pairs after that , each h2 tag is the key , and the non-h2 tags that follow are the value #!/usr/bin/perl -- use strict; use warnings; use HTML::TreeBuilder::XPath; my $page = q{<html> <head></head> <body> <div id="bodyContent"> <!-- start content --> <p>Key words: Some words. </p> <p>Date: 2012-01-16 </p> <p>Actualised: 2008-01-08 </p> <p>Commented: 05.06.2007 </p> <p>Encoded: Some code. </p> <h2> <span class="mw-headline" id="Problem"> Problem </span></h2 +> <p>Problem description. </p> <p>Another description. </p> <h2> <span class="mw-headline" id="Solution1"> Solution 1 </span +></h2> <p>Solution description. </p> <h2> <span class="mw-headline" id="Solution2"> Solution 2 </span +></h2> <p>Solution description. </p> <h2> <span class="mw-headline" id="Comment"> Comment. </span></h +2> <p>Text of the comment. </p> <p> <br/> </p> </div> <hr/> </body> </html>}; my $p = HTML::TreeBuilder::XPath->new_from_content( $page ); { my @nodes = $p->findnodes( q{//div[@id='bodyContent']/*}); use List::AllUtils qw( before ); my @before_h2 = before { $_->tag eq 'h2' } @nodes; splice @nodes, 0, scalar( @before_h2 ); my %body = map { split ':', $_->as_trimmed_text, 2 } @before_h2; while( @nodes ){ my $key = shift(@nodes)->as_trimmed_text; while( @nodes and $nodes[0]->tag ne 'h2' ){ my $val = shift(@nodes)->as_trimmed_text; $body{ $key } .= $val; } } use Data::Dump; dd\%body; } __END__ { "Actualised" => " 2008-01-08", "Comment." => "Text of the comment.", "Commented" => " 05.06.2007", "Date" => " 2012-01-16", "Encoded" => " Some code.", "Key words" => " Some words.", "Problem" => "Problem description.Another description.", "Solution 1" => "Solution description.", "Solution 2" => "Solution description.", } [download]	[reply] [d/l] [select]
Re^2: Extracting HTML content between the h tags by Anonymous Monk on Aug 05, 2012 at 12:53 UTC
The `"Comment."` key stuck out, so a better idea might be to use the @id attribute as key `my $key = shift(@nodes)->findvalue('*[@id]/@id');` [download]	[reply] [d/l] [select]
Re^3: Extracting HTML content between the h tags by vagabonding electron (Curate) on Aug 05, 2012 at 14:09 UTC
Thank you very much! Just tried the both approaches, it works even if the last h2-tag is missing ( appears in about 10 pages from > 400, for which I used the following workaround: `my @solution_2 = $content->findvalues( './h2[4]/preceding-sibling::' +); unless ( @solution_2 ) { @solution_2 = $content->findvalues( '//hr/preceding-sibling::' ); }` [download] ... with substr as before ... Fortunately they have only one hr-tag in the page :-) With your approach it is not necessary anymore. BTW the content after the `<h2>[4]` is not important. Thanks again!	[reply] [d/l] [select]


Problems? Is your data what you think it is?
	PerlMonks