Get HTML::TokeParser::Simple to interpret some tags as text

jonnyfolk has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I'm using the HTML::TokeParser::Simple module (heartfelt thanks, ovid!) to extract text from various web pages:

use HTML::TokeParser::Simple;
 
my @html_docs = ( '/home/site/www/content.html' );
my $lookup = #search phrase
  foreach my $doc ( @html_docs ) {
    my $p = HTML::TokeParser::Simple->new( file => $doc );
    while ( my $token = $p->get_token ) {
      next unless $token->is_text;
      my $line = $token->as_is;
      if ($line =~ /$lookup/) {
        print qq~$line\n~;
      }
    }
  }
[download]

This extracts the text for me but what I would like to be able to do is print certain tags as text so that a paragraph containing an underline, for example, would appear instead of two or three lines of text, as a single line: "this item <u>contains an</u> underline.

I suspect that I would need to use a call to: if ( $token->is_tag ) { ... } but I'm not sure what the requirement would be.

Any tips would be gratefully received

Comment on Get HTML::TokeParser::Simple to interpret some tags as text Select or Download Code

Replies are listed 'Best First'.
Re: Get HTML::TokeParser::Simple to interpret some tags as text by bart (Canon) on May 23, 2006 at 08:27 UTC
Untested: `my %allowed = ( u => 1, i => 1, b => 1); my $text = ""; while ( my $token = $p->get_token ) { if($token->is_text) { $text .= $token->as_is; } elsif($token->is_tag && $allowed{$token->get_tag}) { $text .= $token->as_is; } }` [download] Now you can filter your text in $text, for example like: `my @match = grep /$lookup/, split /\n/, $text; print "$_\n" foreach @match;` [download] There is, of course, the chance that you'll match on the tags now.	[reply] [d/l] [select]
Re^2: Get HTML::TokeParser::Simple to interpret some tags as text by jonnyfolk (Vicar) on May 23, 2006 at 09:08 UTC
OK, I get the idea now. Thanks bart - much appreciated.	[reply]

Back to Seekers of Perl Wisdom