Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Re: Extracting HTML content between the h tags

by Anonymous Monk
on Aug 05, 2012 at 12:37 UTC ( #985521=note: print w/ replies, xml ) Need Help??


in reply to Extracting HTML content between the h tags

perl htmltreexpather.pl fudge.html

------------------------------------------------------------------ HTML::Element=HASH(0xb338e4) 0.1.0.0 Key words: Some words. /html/body/div/p //div[@id='bodyContent']/p //div[@id='bodyContent']/p ------------------------------------------------------------------ HTML::Element=HASH(0xb33944) 0.1.0.1 Date: 2012-01-16 /html/body/div/p[2] //div[@id='bodyContent']/p[2] //div[@id='bodyContent']/p[2] ------------------------------------------------------------------ HTML::Element=HASH(0xb339b4) 0.1.0.2 Actualised: 2008-01-08 /html/body/div/p[3] //div[@id='bodyContent']/p[3] //div[@id='bodyContent']/p[3] ------------------------------------------------------------------ HTML::Element=HASH(0xb33a24) 0.1.0.3 Commented: 05.06.2007 /html/body/div/p[4] //div[@id='bodyContent']/p[4] //div[@id='bodyContent']/p[4] ------------------------------------------------------------------ HTML::Element=HASH(0xb33a94) 0.1.0.4 Encoded: Some code. /html/body/div/p[5] //div[@id='bodyContent']/p[5] //div[@id='bodyContent']/p[5] ------------------------------------------------------------------ HTML::Element=HASH(0xb33b14) 0.1.0.5 Problem /html/body/div/h2 //div[@id='bodyContent']/h2 //div[@id='bodyContent']/h2 ------------------------------------------------------------------ HTML::Element=HASH(0xb33bd4) 0.1.0.5.0 Problem /html/body/div/h2/span //span[@id='Problem'] //span[@id='Problem'] ------------------------------------------------------------------ HTML::Element=HASH(0xb33bc4) 0.1.0.6 Problem description. /html/body/div/p[6] //div[@id='bodyContent']/p[6] //div[@id='bodyContent']/p[6] ------------------------------------------------------------------ HTML::Element=HASH(0xb33c64) 0.1.0.7 Another description. /html/body/div/p[7] //div[@id='bodyContent']/p[7] //div[@id='bodyContent']/p[7] ------------------------------------------------------------------ HTML::Element=HASH(0xb33ce4) 0.1.0.8 Solution 1 /html/body/div/h2[2] //div[@id='bodyContent']/h2[2] //div[@id='bodyContent']/h2[2] ------------------------------------------------------------------ HTML::Element=HASH(0xb33da4) 0.1.0.8.0 Solution 1 /html/body/div/h2[2]/span //span[@id='Solution1'] //span[@id='Solution1'] ------------------------------------------------------------------ HTML::Element=HASH(0xb33d94) 0.1.0.9 Solution description. /html/body/div/p[8] //div[@id='bodyContent']/p[8] //div[@id='bodyContent']/p[8] ------------------------------------------------------------------ HTML::Element=HASH(0xb33e44) 0.1.0.10 Solution 2 /html/body/div/h2[3] //div[@id='bodyContent']/h2[3] //div[@id='bodyContent']/h2[3] ------------------------------------------------------------------ HTML::Element=HASH(0xb33f04) 0.1.0.10.0 Solution 2 /html/body/div/h2[3]/span //span[@id='Solution2'] //span[@id='Solution2'] ------------------------------------------------------------------ HTML::Element=HASH(0xb33f44) 0.1.0.11 Solution description. /html/body/div/p[9] //div[@id='bodyContent']/p[9] //div[@id='bodyContent']/p[9] ------------------------------------------------------------------ HTML::Element=HASH(0xb33fb4) 0.1.0.12 Comment. /html/body/div/h2[4] //div[@id='bodyContent']/h2[4] //div[@id='bodyContent']/h2[4] ------------------------------------------------------------------ HTML::Element=HASH(0xb34074) 0.1.0.12.0 Comment. /html/body/div/h2[4]/span //span[@id='Comment'] //span[@id='Comment'] ------------------------------------------------------------------ HTML::Element=HASH(0xb34064) 0.1.0.13 Text of the comment. /html/body/div/p[10] //div[@id='bodyContent']/p[10] //div[@id='bodyContent']/p[10] ------------------------------------------------------------------ ##################################################################

Hmm, so I would use the stack approach, ie *find*

q{ //div[@id='bodyContent']/* }

everything before first h2 tag is key/value pairs

after that , each h2 tag is the key , and the non-h2 tags that follow are the value

#!/usr/bin/perl -- use strict; use warnings; use HTML::TreeBuilder::XPath; my $page = q{<html> <head></head> <body> <div id="bodyContent"> <!-- start content --> <p>Key words: Some words. </p> <p>Date: 2012-01-16 </p> <p>Actualised: 2008-01-08 </p> <p>Commented: 05.06.2007 </p> <p>Encoded: Some code. </p> <h2> <span class="mw-headline" id="Problem"> Problem </span></h2 +> <p>Problem description. </p> <p>Another description. </p> <h2> <span class="mw-headline" id="Solution1"> Solution 1 </span +></h2> <p>Solution description. </p> <h2> <span class="mw-headline" id="Solution2"> Solution 2 </span +></h2> <p>Solution description. </p> <h2> <span class="mw-headline" id="Comment"> Comment. </span></h +2> <p>Text of the comment. </p> <p> <br/> </p> </div> <hr/> </body> </html>}; my $p = HTML::TreeBuilder::XPath->new_from_content( $page ); { my @nodes = $p->findnodes( q{//div[@id='bodyContent']/*}); use List::AllUtils qw( before ); my @before_h2 = before { $_->tag eq 'h2' } @nodes; splice @nodes, 0, scalar( @before_h2 ); my %body = map { split ':', $_->as_trimmed_text, 2 } @before_h2; while( @nodes ){ my $key = shift(@nodes)->as_trimmed_text; while( @nodes and $nodes[0]->tag ne 'h2' ){ my $val = shift(@nodes)->as_trimmed_text; $body{ $key } .= $val; } } use Data::Dump; dd\%body; } __END__ { "Actualised" => " 2008-01-08", "Comment." => "Text of the comment.", "Commented" => " 05.06.2007", "Date" => " 2012-01-16", "Encoded" => " Some code.", "Key words" => " Some words.", "Problem" => "Problem description.Another description.", "Solution 1" => "Solution description.", "Solution 2" => "Solution description.", }


Comment on Re: Extracting HTML content between the h tags
Select or Download Code
Replies are listed 'Best First'.
Re^2: Extracting HTML content between the h tags
by Anonymous Monk on Aug 05, 2012 at 12:53 UTC
    The   "Comment." key stuck out, so a better idea might be to use the @id attribute as key
    my $key = shift(@nodes)->findvalue('*[@id]/@id');
      Thank you very much!
      Just tried the both approaches, it works even if the last h2-tag is missing ( appears in about 10 pages from > 400, for which I used the following workaround:
      my @solution_2 = $content->findvalues( './h2[4]/preceding-sibling::*' +); unless ( @solution_2 ) { @solution_2 = $content->findvalues( '//hr/preceding-sibling::*' ); }
      ... with substr as before ...
      Fortunately they have only one hr-tag in the page :-)
      With your approach it is not necessary anymore.
      BTW the content after the <h2>[4] is not important.
      Thanks again!

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://985521]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others surveying the Monastery: (12)
As of 2015-07-28 10:23 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (254 votes), past polls