perlquestion
vagabonding electron
Dear Monks,<br>
I parse a certain amount of HTML pages ( > 400 ) which have a structure as shown in the <c><DATA></c> part of the script below. <br> The relevant part of the page begins with <c><div id="bodyContent"></c> so that I put this part only in the script.<br>What I need is the text between the certain <c><h2></c>-tags.<br>I used HTML:TreeBuilder:XPath but I did not find how I could formulate an intersection there (e.g. following of <c><h2>[1]</c> <b>and</b> preceding of <c><h2>[2]</c> at the same time).<br> As a workaround I take the preceding-sibling in sequence of <c><h2>[i]</c> tags, stringify the output and use substr to subtract the preceding chunks of text.<br>This works (after some clean up) but the code looks no fun to me.<br>Please give me a hint how I could make it better.<br>Thank you!<br>VE
<c>
#!/usr/bin/perl
use strict;
use warnings;
use LWP::Simple;
use HTML::TreeBuilder::XPath;
my $page;
$page .= $_ while <DATA>;
my $p = HTML::TreeBuilder::XPath->new_from_content( $page );
my @page_content =$p->findnodes( '//div[@id="bodyContent"]' );
for my $content ( @page_content )
{
my @preface = $content->findvalues( './h2[1]/preceding-sibling::*' );
my $preface_text;
my ( $keyword, $actualised );
for my $pref ( @preface )
{
# $pref =~ s/^\s*(\S+)/$1/;
$preface_text .= $pref;
# print $preface_text, "--\n";
( undef, $keyword ) = split /:\s*?/, $pref, 2 if $pref =~ /^\s*?Key words/;
( undef, $actualised ) = split /:\s*?/, $pref, 2 if $pref =~ /^Actualised/;
}
print $keyword, "\n";
print $actualised, "\n";
my @problems = $content->findvalues( './h2[2]/preceding-sibling::*' );
my $probl;
$probl .= $_ for @problems;
$probl = substr( $probl, length( $preface_text) );
print $probl, "\n";
my @solution_1 = $content->findvalues( './h2[3]/preceding-sibling::*' );
my $sol;
$sol .= $_ for @solution_1;
$sol = substr( $sol, length( $preface_text ) + length( $probl ) );
print $sol, "\n";
my @solution_2 = $content->findvalues( './h2[4]/preceding-sibling::*' );
my $sol_2;
$sol_2 .= $_ for @solution_2;
$sol_2 = substr( $sol_2, length( $preface_text ) + length( $probl ) + length( $sol ) );
print $sol_2 , "\n";
}
__DATA__
<head>
</head>
<body>
<div id="bodyContent">
<!-- start content -->
<p>Key words: Some words.
</p><p>Date: 2012-01-16
</p><p>Actualised: 2008-01-08
</p><p>Commented: 05.06.2007
</p><p>Encoded: Some code.
</p>
<h2> <span class="mw-headline" id="Problem"> Problem </span></h2>
<p>Problem description.
</p><p>Another description.
</p>
<h2> <span class="mw-headline" id="Solution1"> Solution 1 </span></h2>
<p>Solution description.
</p>
<h2> <span class="mw-headline" id="Solution2"> Solution 2 </span></h2>
<p>Solution description.
</p>
<h2> <span class="mw-headline" id="Comment"> Comment. </span></h2>
<p>Text of the comment.
</p><p><br />
</p>
</div>
<hr />
</body>
</c>