Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?

Re: Extracting HTML content between the h tags

by flexvault (Monsignor)
on Aug 05, 2012 at 13:39 UTC ( #985528=note: print w/replies, xml ) Need Help??

in reply to Extracting HTML content between the h tags

vagabonding electron,

Just another way to look at the problem:

#!/usr/bin/perl use strict; use warnings; my $start = 0; my $h2 = 0; my $keyword = ""; my $end = 0; while ( my $Content = <DATA> ) { chomp( $Content ); my $content = lc( $Content ); if ( $start == 0 ) { $start = index ( $content, '<div id="bodyContent">' ); } else { if ( $h2 == 0 ) { my $h = index ( $content, '<h2>' ); if ( $h >= 0 ) { $h2++; $keyword = substr( $Content, $h+4 ); my $tmp = + lc ( $keyword ); $end = index ( $tmp, '</h2>' ); if ( $end >= 0 ) { $keyword = substr( $keyword, 0, $end ); print "$keyword\n\n"; $h2 = 0; $keyword = ""; } } } else { $end = index ( $content, '</h2>' ); if ( $end >= 0 ) { $keyword .= substr( $Content, 0, $end ); print "$keyword\n\n"; $h2 = 0; $keyword = ""; } else { $keyword .= $Content; } } } } __DATA__ <head> </head> <body> <div id="bodyContent"> <!-- start content --> <p>Key words: Some words. </p><p>Date: 2012-01-16 </p><p>Actualised: 2008-01-08 </p><p>Commented: 05.06.2007 </p><p>Encoded: Some code. </p> <h2> <span class="mw-headline" id="Problem"> Problem </span></h2> <p>Problem description. </p><p>Another description. </p> <h2> <span class="mw-headline" id="Solution1"> Solution 1 </span></h2> <p>Solution description. </p> <h2> <span class="mw-headline" id="Solution2"> Solution 2 </span></h2> <p>Solution description. </p> <h2> <span class="mw-headline" id="Comment"> Comment. </span> </h2> <p>Text of the comment. </p><p><br /> </p> </div> <hr /> </body>

When working with HTML, you can't be sure that everything is lined up correctly. Notice I put some end-of lines before the last </h2> just to make sure it could handle multiple lines.

Another approach would be to process the HTML and extract the <h2>...</h2> into an array and then process the array to eliminate span, fonts, etc after you have complete information in each element of the array.

Good Luck!

"Well done is better than well said." - Benjamin Franklin

Replies are listed 'Best First'.
Re^2: Extracting HTML content between the h tags
by vagabonding electron (Chaplain) on Aug 05, 2012 at 14:34 UTC

    Thank you very much!
    Since I have read a lot that one should not parse a HTML without a module I did not try this before either :-)
    I will certainly check this approach out.
    I think it could be difficult in case that the last hr-tag is missing (described in Re^3: Extracting HTML content between the h tags ).
    Thanks again!
      p>vagabonding electron,

      For the missing hr-tag, just test for $keyword after the 'while' loop:

      if ( $keyword ) ## Same as if ( $keyword ne "" ) { print "$keyword\n"; }

      Whenever I start a new project/gig, I try to think whether this is similar to something I've done before, and if it is, then I use that code or technique as the starting point. If it is totally new (very rare), I still have a bag of tricks ( subroutines ) that I copy ( use ... ) into the new work. Look at every thing you do today as something you may be able to use for the rest of your programming life.

      You're lucky to have Perl, since a lot of the code I did before Perl is worthless today, but knowledge and techniques can be applied to Perl!

      Good Luck!

      "Well done is better than well said." - Benjamin Franklin

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://985528]
[ambrus]: TCLion: it's to make people write non-short writeups in an external editor and save it to a local file. otherwise people will complain that they had a very insightful extended reply but their browser died just when they were almost ready posting it.
[ww]: ++ambrus "their browser died just when...."
[ambrus]: ww: or they accidentally closed the window, or something. the exact excuse doesn't matter.
[ambrus]: you hear that on web forums frequently.
LanX Fermat 's famous browser
[ww]: ambrus agree; got the t-shirt, et al

How do I use this? | Other CB clients
Other Users?
Others wandering the Monastery: (16)
As of 2017-03-23 18:16 GMT
Find Nodes?
    Voting Booth?
    Should Pluto Get Its Planethood Back?

    Results (292 votes). Check out past polls.