Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Extracting HTML content between the h tags

by vagabonding electron (Curate)
on Aug 05, 2012 at 11:41 UTC ( [id://985516]=perlquestion: print w/replies, xml ) Need Help??

vagabonding electron has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,
I parse a certain amount of HTML pages ( > 400 ) which have a structure as shown in the <DATA> part of the script below.
The relevant part of the page begins with <div id="bodyContent"> so that I put this part only in the script.
What I need is the text between the certain <h2>-tags.
I used HTML:TreeBuilder:XPath but I did not find how I could formulate an intersection there (e.g. following of <h2>[1] and preceding of <h2>[2] at the same time).
As a workaround I take the preceding-sibling in sequence of <h2>[i] tags, stringify the output and use substr to subtract the preceding chunks of text.
This works (after some clean up) but the code looks no fun to me.
Please give me a hint how I could make it better.
Thank you!
VE
#!/usr/bin/perl use strict; use warnings; use LWP::Simple; use HTML::TreeBuilder::XPath; my $page; $page .= $_ while <DATA>; my $p = HTML::TreeBuilder::XPath->new_from_content( $page ); my @page_content =$p->findnodes( '//div[@id="bodyContent"]' ); for my $content ( @page_content ) { my @preface = $content->findvalues( './h2[1]/preceding-sibling::*' + ); my $preface_text; my ( $keyword, $actualised ); for my $pref ( @preface ) { # $pref =~ s/^\s*(\S+)/$1/; $preface_text .= $pref; # print $preface_text, "--\n"; ( undef, $keyword ) = split /:\s*?/, $pref, 2 if $pref =~ +/^\s*?Key words/; ( undef, $actualised ) = split /:\s*?/, $pref, 2 if $pref +=~ /^Actualised/; } print $keyword, "\n"; print $actualised, "\n"; my @problems = $content->findvalues( './h2[2]/preceding-siblin +g::*' ); my $probl; $probl .= $_ for @problems; $probl = substr( $probl, length( $preface_text) ); print $probl, "\n"; my @solution_1 = $content->findvalues( './h2[3]/preceding-sibl +ing::*' ); my $sol; $sol .= $_ for @solution_1; $sol = substr( $sol, length( $preface_text ) + length( $probl +) ); print $sol, "\n"; my @solution_2 = $content->findvalues( './h2[4]/preceding-sibl +ing::*' ); my $sol_2; $sol_2 .= $_ for @solution_2; $sol_2 = substr( $sol_2, length( $preface_text ) + length( $pr +obl ) + length( $sol ) ); print $sol_2 , "\n"; } __DATA__ <head> </head> <body> <div id="bodyContent"> <!-- start content --> <p>Key words: Some words. </p><p>Date: 2012-01-16 </p><p>Actualised: 2008-01-08 </p><p>Commented: 05.06.2007 </p><p>Encoded: Some code. </p> <h2> <span class="mw-headline" id="Problem"> Problem </span></h2> <p>Problem description. </p><p>Another description. </p> <h2> <span class="mw-headline" id="Solution1"> Solution 1 </span></h2> <p>Solution description. </p> <h2> <span class="mw-headline" id="Solution2"> Solution 2 </span></h2> <p>Solution description. </p> <h2> <span class="mw-headline" id="Comment"> Comment. </span></h2> <p>Text of the comment. </p><p><br /> </p> </div> <hr /> </body>

Replies are listed 'Best First'.
Re: Extracting HTML content between the h tags
by Gangabass (Vicar) on Aug 05, 2012 at 12:55 UTC
    Something like this (just for first ps):
    my @nodes = $p->findnodes('//h2[2]/preceding-sibling::p[preceding-sibl +ing::h2[1]]');
      Thank you a lot! I did not know this syntax.
      One more question if I dare :-)
      In about 10 pages the last h2-tag is missing, so that I used the following workaround:
      my @solution_2 = $content->findvalues( './h2[4]/preceding-sibling::*' +); unless ( @solution_2 ) { @solution_2 = $content->findvalues( '//hr/preceding-sibling::*' ); }
      I tried the same with your syntax as:
      @solution_2 = $content->findvalues( '//hr/preceding-sibling::p[precedi +ng-sibling::h2[3]]' );
      but I get an uninitialized value only.
      I understood the syntax so: "search the siblings but stop if the tag in brackets appears". Is this correct? If so, what am I doing false with the above attempt?
      Spasibo!
        According to your HTML preceding-sibling for hr will be div tag but not p tag... So this code will find all ps after last h2:
        $p->findnodes('//h2[4]/following-sibling::p');
        Or (more flexible):
        $p->findnodes('//h2[last()]/following-sibling::p');
Re: Extracting HTML content between the h tags
by flexvault (Monsignor) on Aug 05, 2012 at 13:39 UTC

    vagabonding electron,

    Just another way to look at the problem:

    #!/usr/bin/perl use strict; use warnings; my $start = 0; my $h2 = 0; my $keyword = ""; my $end = 0; while ( my $Content = <DATA> ) { chomp( $Content ); my $content = lc( $Content ); if ( $start == 0 ) { $start = index ( $content, '<div id="bodyContent">' ); } else { if ( $h2 == 0 ) { my $h = index ( $content, '<h2>' ); if ( $h >= 0 ) { $h2++; $keyword = substr( $Content, $h+4 ); my $tmp = + lc ( $keyword ); $end = index ( $tmp, '</h2>' ); if ( $end >= 0 ) { $keyword = substr( $keyword, 0, $end ); print "$keyword\n\n"; $h2 = 0; $keyword = ""; } } } else { $end = index ( $content, '</h2>' ); if ( $end >= 0 ) { $keyword .= substr( $Content, 0, $end ); print "$keyword\n\n"; $h2 = 0; $keyword = ""; } else { $keyword .= $Content; } } } } __DATA__ <head> </head> <body> <div id="bodyContent"> <!-- start content --> <p>Key words: Some words. </p><p>Date: 2012-01-16 </p><p>Actualised: 2008-01-08 </p><p>Commented: 05.06.2007 </p><p>Encoded: Some code. </p> <h2> <span class="mw-headline" id="Problem"> Problem </span></h2> <p>Problem description. </p><p>Another description. </p> <h2> <span class="mw-headline" id="Solution1"> Solution 1 </span></h2> <p>Solution description. </p> <h2> <span class="mw-headline" id="Solution2"> Solution 2 </span></h2> <p>Solution description. </p> <h2> <span class="mw-headline" id="Comment"> Comment. </span> </h2> <p>Text of the comment. </p><p><br /> </p> </div> <hr /> </body>

    When working with HTML, you can't be sure that everything is lined up correctly. Notice I put some end-of lines before the last </h2> just to make sure it could handle multiple lines.

    Another approach would be to process the HTML and extract the <h2>...</h2> into an array and then process the array to eliminate span, fonts, etc after you have complete information in each element of the array.

    Good Luck!

    "Well done is better than well said." - Benjamin Franklin

      flexvault

      Thank you very much!
      Since I have read a lot that one should not parse a HTML without a module I did not try this before either :-)
      I will certainly check this approach out.
      I think it could be difficult in case that the last hr-tag is missing (described in Re^3: Extracting HTML content between the h tags ).
      Thanks again!
        p>vagabonding electron,

        For the missing hr-tag, just test for $keyword after the 'while' loop:

        if ( $keyword ) ## Same as if ( $keyword ne "" ) { print "$keyword\n"; }

        Whenever I start a new project/gig, I try to think whether this is similar to something I've done before, and if it is, then I use that code or technique as the starting point. If it is totally new (very rare), I still have a bag of tricks ( subroutines ) that I copy ( use ... ) into the new work. Look at every thing you do today as something you may be able to use for the rest of your programming life.

        You're lucky to have Perl, since a lot of the code I did before Perl is worthless today, but knowledge and techniques can be applied to Perl!

        Good Luck!

        "Well done is better than well said." - Benjamin Franklin

Re: Extracting HTML content between the h tags
by Anonymous Monk on Aug 05, 2012 at 12:37 UTC

    perl htmltreexpather.pl fudge.html

    ------------------------------------------------------------------ HTML::Element=HASH(0xb338e4) 0.1.0.0 Key words: Some words. /html/body/div/p //div[@id='bodyContent']/p //div[@id='bodyContent']/p ------------------------------------------------------------------ HTML::Element=HASH(0xb33944) 0.1.0.1 Date: 2012-01-16 /html/body/div/p[2] //div[@id='bodyContent']/p[2] //div[@id='bodyContent']/p[2] ------------------------------------------------------------------ HTML::Element=HASH(0xb339b4) 0.1.0.2 Actualised: 2008-01-08 /html/body/div/p[3] //div[@id='bodyContent']/p[3] //div[@id='bodyContent']/p[3] ------------------------------------------------------------------ HTML::Element=HASH(0xb33a24) 0.1.0.3 Commented: 05.06.2007 /html/body/div/p[4] //div[@id='bodyContent']/p[4] //div[@id='bodyContent']/p[4] ------------------------------------------------------------------ HTML::Element=HASH(0xb33a94) 0.1.0.4 Encoded: Some code. /html/body/div/p[5] //div[@id='bodyContent']/p[5] //div[@id='bodyContent']/p[5] ------------------------------------------------------------------ HTML::Element=HASH(0xb33b14) 0.1.0.5 Problem /html/body/div/h2 //div[@id='bodyContent']/h2 //div[@id='bodyContent']/h2 ------------------------------------------------------------------ HTML::Element=HASH(0xb33bd4) 0.1.0.5.0 Problem /html/body/div/h2/span //span[@id='Problem'] //span[@id='Problem'] ------------------------------------------------------------------ HTML::Element=HASH(0xb33bc4) 0.1.0.6 Problem description. /html/body/div/p[6] //div[@id='bodyContent']/p[6] //div[@id='bodyContent']/p[6] ------------------------------------------------------------------ HTML::Element=HASH(0xb33c64) 0.1.0.7 Another description. /html/body/div/p[7] //div[@id='bodyContent']/p[7] //div[@id='bodyContent']/p[7] ------------------------------------------------------------------ HTML::Element=HASH(0xb33ce4) 0.1.0.8 Solution 1 /html/body/div/h2[2] //div[@id='bodyContent']/h2[2] //div[@id='bodyContent']/h2[2] ------------------------------------------------------------------ HTML::Element=HASH(0xb33da4) 0.1.0.8.0 Solution 1 /html/body/div/h2[2]/span //span[@id='Solution1'] //span[@id='Solution1'] ------------------------------------------------------------------ HTML::Element=HASH(0xb33d94) 0.1.0.9 Solution description. /html/body/div/p[8] //div[@id='bodyContent']/p[8] //div[@id='bodyContent']/p[8] ------------------------------------------------------------------ HTML::Element=HASH(0xb33e44) 0.1.0.10 Solution 2 /html/body/div/h2[3] //div[@id='bodyContent']/h2[3] //div[@id='bodyContent']/h2[3] ------------------------------------------------------------------ HTML::Element=HASH(0xb33f04) 0.1.0.10.0 Solution 2 /html/body/div/h2[3]/span //span[@id='Solution2'] //span[@id='Solution2'] ------------------------------------------------------------------ HTML::Element=HASH(0xb33f44) 0.1.0.11 Solution description. /html/body/div/p[9] //div[@id='bodyContent']/p[9] //div[@id='bodyContent']/p[9] ------------------------------------------------------------------ HTML::Element=HASH(0xb33fb4) 0.1.0.12 Comment. /html/body/div/h2[4] //div[@id='bodyContent']/h2[4] //div[@id='bodyContent']/h2[4] ------------------------------------------------------------------ HTML::Element=HASH(0xb34074) 0.1.0.12.0 Comment. /html/body/div/h2[4]/span //span[@id='Comment'] //span[@id='Comment'] ------------------------------------------------------------------ HTML::Element=HASH(0xb34064) 0.1.0.13 Text of the comment. /html/body/div/p[10] //div[@id='bodyContent']/p[10] //div[@id='bodyContent']/p[10] ------------------------------------------------------------------ ##################################################################

    Hmm, so I would use the stack approach, ie *find*

    q{ //div[@id='bodyContent']/* }

    everything before first h2 tag is key/value pairs

    after that , each h2 tag is the key , and the non-h2 tags that follow are the value

    #!/usr/bin/perl -- use strict; use warnings; use HTML::TreeBuilder::XPath; my $page = q{<html> <head></head> <body> <div id="bodyContent"> <!-- start content --> <p>Key words: Some words. </p> <p>Date: 2012-01-16 </p> <p>Actualised: 2008-01-08 </p> <p>Commented: 05.06.2007 </p> <p>Encoded: Some code. </p> <h2> <span class="mw-headline" id="Problem"> Problem </span></h2 +> <p>Problem description. </p> <p>Another description. </p> <h2> <span class="mw-headline" id="Solution1"> Solution 1 </span +></h2> <p>Solution description. </p> <h2> <span class="mw-headline" id="Solution2"> Solution 2 </span +></h2> <p>Solution description. </p> <h2> <span class="mw-headline" id="Comment"> Comment. </span></h +2> <p>Text of the comment. </p> <p> <br/> </p> </div> <hr/> </body> </html>}; my $p = HTML::TreeBuilder::XPath->new_from_content( $page ); { my @nodes = $p->findnodes( q{//div[@id='bodyContent']/*}); use List::AllUtils qw( before ); my @before_h2 = before { $_->tag eq 'h2' } @nodes; splice @nodes, 0, scalar( @before_h2 ); my %body = map { split ':', $_->as_trimmed_text, 2 } @before_h2; while( @nodes ){ my $key = shift(@nodes)->as_trimmed_text; while( @nodes and $nodes[0]->tag ne 'h2' ){ my $val = shift(@nodes)->as_trimmed_text; $body{ $key } .= $val; } } use Data::Dump; dd\%body; } __END__ { "Actualised" => " 2008-01-08", "Comment." => "Text of the comment.", "Commented" => " 05.06.2007", "Date" => " 2012-01-16", "Encoded" => " Some code.", "Key words" => " Some words.", "Problem" => "Problem description.Another description.", "Solution 1" => "Solution description.", "Solution 2" => "Solution description.", }
      The   "Comment." key stuck out, so a better idea might be to use the @id attribute as key
      my $key = shift(@nodes)->findvalue('*[@id]/@id');
        Thank you very much!
        Just tried the both approaches, it works even if the last h2-tag is missing ( appears in about 10 pages from > 400, for which I used the following workaround:
        my @solution_2 = $content->findvalues( './h2[4]/preceding-sibling::*' +); unless ( @solution_2 ) { @solution_2 = $content->findvalues( '//hr/preceding-sibling::*' ); }
        ... with substr as before ...
        Fortunately they have only one hr-tag in the page :-)
        With your approach it is not necessary anymore.
        BTW the content after the <h2>[4] is not important.
        Thanks again!

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://985516]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others wandering the Monastery: (5)
As of 2024-04-18 20:22 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found