http://www.perlmonks.org?node_id=551093

jonnyfolk has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I'm using the HTML::TokeParser::Simple module (heartfelt thanks, ovid!) to extract text from various web pages:

use HTML::TokeParser::Simple; my @html_docs = ( '/home/site/www/content.html' ); my $lookup = #search phrase foreach my $doc ( @html_docs ) { my $p = HTML::TokeParser::Simple->new( file => $doc ); while ( my $token = $p->get_token ) { next unless $token->is_text; my $line = $token->as_is; if ($line =~ /$lookup/) { print qq~$line\n~; } } }
This extracts the text for me but what I would like to be able to do is print certain tags as text so that a paragraph containing an underline, for example, would appear instead of two or three lines of text, as a single line: "this item <u>contains an</u> underline.

I suspect that I would need to use a call to: if ( $token->is_tag ) { ... } but I'm not sure what the requirement would be.

Any tips would be gratefully received

Replies are listed 'Best First'.
Re: Get HTML::TokeParser::Simple to interpret some tags as text
by bart (Canon) on May 23, 2006 at 08:27 UTC
    Untested:
    my %allowed = ( u => 1, i => 1, b => 1); my $text = ""; while ( my $token = $p->get_token ) { if($token->is_text) { $text .= $token->as_is; } elsif($token->is_tag && $allowed{$token->get_tag}) { $text .= $token->as_is; } }
    Now you can filter your text in $text, for example like:
    my @match = grep /$lookup/, split /\n/, $text; print "$_\n" foreach @match;
    There is, of course, the chance that you'll match on the tags now.
      OK, I get the idea now. Thanks bart - much appreciated.