Re: Extracting HTML content between the h tags

perl htmltreexpather.pl fudge.html

------------------------------------------------------------------
HTML::Element=HASH(0xb338e4)    0.1.0.0
Key words: Some words.
/html/body/div/p
//div[@id='bodyContent']/p
//div[@id='bodyContent']/p
------------------------------------------------------------------
HTML::Element=HASH(0xb33944)    0.1.0.1
Date: 2012-01-16
/html/body/div/p[2]
//div[@id='bodyContent']/p[2]
//div[@id='bodyContent']/p[2]
------------------------------------------------------------------
HTML::Element=HASH(0xb339b4)    0.1.0.2
Actualised: 2008-01-08
/html/body/div/p[3]
//div[@id='bodyContent']/p[3]
//div[@id='bodyContent']/p[3]
------------------------------------------------------------------
HTML::Element=HASH(0xb33a24)    0.1.0.3
Commented: 05.06.2007
/html/body/div/p[4]
//div[@id='bodyContent']/p[4]
//div[@id='bodyContent']/p[4]
------------------------------------------------------------------
HTML::Element=HASH(0xb33a94)    0.1.0.4
Encoded: Some code.
/html/body/div/p[5]
//div[@id='bodyContent']/p[5]
//div[@id='bodyContent']/p[5]
------------------------------------------------------------------
HTML::Element=HASH(0xb33b14)    0.1.0.5
Problem
/html/body/div/h2
//div[@id='bodyContent']/h2
//div[@id='bodyContent']/h2
------------------------------------------------------------------
HTML::Element=HASH(0xb33bd4)    0.1.0.5.0
Problem
/html/body/div/h2/span
//span[@id='Problem']
//span[@id='Problem']
------------------------------------------------------------------
HTML::Element=HASH(0xb33bc4)    0.1.0.6
Problem description.
/html/body/div/p[6]
//div[@id='bodyContent']/p[6]
//div[@id='bodyContent']/p[6]
------------------------------------------------------------------
HTML::Element=HASH(0xb33c64)    0.1.0.7
Another description.
/html/body/div/p[7]
//div[@id='bodyContent']/p[7]
//div[@id='bodyContent']/p[7]
------------------------------------------------------------------
HTML::Element=HASH(0xb33ce4)    0.1.0.8
Solution 1
/html/body/div/h2[2]
//div[@id='bodyContent']/h2[2]
//div[@id='bodyContent']/h2[2]
------------------------------------------------------------------
HTML::Element=HASH(0xb33da4)    0.1.0.8.0
Solution 1
/html/body/div/h2[2]/span
//span[@id='Solution1']
//span[@id='Solution1']
------------------------------------------------------------------
HTML::Element=HASH(0xb33d94)    0.1.0.9
Solution description.
/html/body/div/p[8]
//div[@id='bodyContent']/p[8]
//div[@id='bodyContent']/p[8]
------------------------------------------------------------------
HTML::Element=HASH(0xb33e44)    0.1.0.10
Solution 2
/html/body/div/h2[3]
//div[@id='bodyContent']/h2[3]
//div[@id='bodyContent']/h2[3]
------------------------------------------------------------------
HTML::Element=HASH(0xb33f04)    0.1.0.10.0
Solution 2
/html/body/div/h2[3]/span
//span[@id='Solution2']
//span[@id='Solution2']
------------------------------------------------------------------
HTML::Element=HASH(0xb33f44)    0.1.0.11
Solution description.
/html/body/div/p[9]
//div[@id='bodyContent']/p[9]
//div[@id='bodyContent']/p[9]
------------------------------------------------------------------
HTML::Element=HASH(0xb33fb4)    0.1.0.12
Comment.
/html/body/div/h2[4]
//div[@id='bodyContent']/h2[4]
//div[@id='bodyContent']/h2[4]
------------------------------------------------------------------
HTML::Element=HASH(0xb34074)    0.1.0.12.0
Comment.
/html/body/div/h2[4]/span
//span[@id='Comment']
//span[@id='Comment']
------------------------------------------------------------------
HTML::Element=HASH(0xb34064)    0.1.0.13
Text of the comment.
/html/body/div/p[10]
//div[@id='bodyContent']/p[10]
//div[@id='bodyContent']/p[10]
------------------------------------------------------------------
##################################################################
[download]

Hmm, so I would use the stack approach, ie *find*

q{
//div[@id='bodyContent']/*
}
[download]

everything before first h2 tag is key/value pairs

after that , each h2 tag is the key , and the non-h2 tags that follow are the value

#!/usr/bin/perl --
use strict; use warnings;
use HTML::TreeBuilder::XPath;

my $page = q{<html>
  <head></head>
  <body>
    <div id="bodyContent">
      <!-- start content -->
      <p>Key words: Some words.
</p>
      <p>Date:  2012-01-16
</p>
      <p>Actualised: 2008-01-08
</p>
      <p>Commented: 05.06.2007
</p>
      <p>Encoded: Some code.
</p>
      <h2> <span class="mw-headline" id="Problem"> Problem </span></h2
+>
      <p>Problem description.
</p>
      <p>Another description.
</p>
      <h2> <span class="mw-headline" id="Solution1"> Solution 1 </span
+></h2>
      <p>Solution description.
</p>
      <h2> <span class="mw-headline" id="Solution2"> Solution 2 </span
+></h2>
      <p>Solution description.
</p>
      <h2> <span class="mw-headline" id="Comment"> Comment. </span></h
+2>
      <p>Text of the comment.
</p>
      <p>
        <br/>
      </p>
    </div>
    <hr/>
  </body>
</html>};

my $p = HTML::TreeBuilder::XPath->new_from_content( $page );
{
    my @nodes = $p->findnodes( q{//div[@id='bodyContent']/*}); 

    use List::AllUtils qw( before  );
    my @before_h2 = before { $_->tag eq 'h2' } @nodes;
    splice @nodes, 0, scalar( @before_h2 );
    
    my %body = map { split ':', $_->as_trimmed_text, 2 } @before_h2;
    
    while( @nodes ){
        my $key = shift(@nodes)->as_trimmed_text;
        
        while( @nodes and $nodes[0]->tag ne 'h2' ){
            my $val = shift(@nodes)->as_trimmed_text;
            $body{ $key } .= $val;
        }
    }
    use Data::Dump; dd\%body;
}

__END__
{
  "Actualised" => " 2008-01-08",
  "Comment."   => "Text of the comment.",
  "Commented"  => " 05.06.2007",
  "Date"       => " 2012-01-16",
  "Encoded"    => " Some code.",
  "Key words"  => " Some words.",
  "Problem"    => "Problem description.Another description.",
  "Solution 1" => "Solution description.",
  "Solution 2" => "Solution description.",
}
[download]

Comment on Re: Extracting HTML content between the h tags Select or Download Code


Perl: the Markov chain saw
	PerlMonks