Dear Monks,
I parse a certain amount of HTML pages ( > 400 ) which have a structure as shown in the
<DATA> part of the script below.
The relevant part of the page begins with
<div id="bodyContent"> so that I put this part only in the script.
What I need is the text between the certain
<h2>-tags.
I used HTML:TreeBuilder:XPath but I did not find how I could formulate an intersection there (e.g. following of
<h2>[1] and preceding of
<h2>[2] at the same time).
As a workaround I take the preceding-sibling in sequence of
<h2>[i] tags, stringify the output and use substr to subtract the preceding chunks of text.
This works (after some clean up) but the code looks no fun to me.
Please give me a hint how I could make it better.
Thank you!
VE
#!/usr/bin/perl
use strict;
use warnings;
use LWP::Simple;
use HTML::TreeBuilder::XPath;
my $page;
$page .= $_ while <DATA>;
my $p = HTML::TreeBuilder::XPath->new_from_content( $page );
my @page_content =$p->findnodes( '//div[@id="bodyContent"]' );
for my $content ( @page_content )
{
my @preface = $content->findvalues( './h2[1]/preceding-sibling::*'
+ );
my $preface_text;
my ( $keyword, $actualised );
for my $pref ( @preface )
{
# $pref =~ s/^\s*(\S+)/$1/;
$preface_text .= $pref;
# print $preface_text, "--\n";
( undef, $keyword ) = split /:\s*?/, $pref, 2 if $pref =~
+/^\s*?Key words/;
( undef, $actualised ) = split /:\s*?/, $pref, 2 if $pref
+=~ /^Actualised/;
}
print $keyword, "\n";
print $actualised, "\n";
my @problems = $content->findvalues( './h2[2]/preceding-siblin
+g::*' );
my $probl;
$probl .= $_ for @problems;
$probl = substr( $probl, length( $preface_text) );
print $probl, "\n";
my @solution_1 = $content->findvalues( './h2[3]/preceding-sibl
+ing::*' );
my $sol;
$sol .= $_ for @solution_1;
$sol = substr( $sol, length( $preface_text ) + length( $probl
+) );
print $sol, "\n";
my @solution_2 = $content->findvalues( './h2[4]/preceding-sibl
+ing::*' );
my $sol_2;
$sol_2 .= $_ for @solution_2;
$sol_2 = substr( $sol_2, length( $preface_text ) + length( $pr
+obl ) + length( $sol ) );
print $sol_2 , "\n";
}
__DATA__
<head>
</head>
<body>
<div id="bodyContent">
<!-- start content -->
<p>Key words: Some words.
</p><p>Date: 2012-01-16
</p><p>Actualised: 2008-01-08
</p><p>Commented: 05.06.2007
</p><p>Encoded: Some code.
</p>
<h2> <span class="mw-headline" id="Problem"> Problem </span></h2>
<p>Problem description.
</p><p>Another description.
</p>
<h2> <span class="mw-headline" id="Solution1"> Solution 1 </span></h2>
<p>Solution description.
</p>
<h2> <span class="mw-headline" id="Solution2"> Solution 2 </span></h2>
<p>Solution description.
</p>
<h2> <span class="mw-headline" id="Comment"> Comment. </span></h2>
<p>Text of the comment.
</p><p><br />
</p>
</div>
<hr />
</body>