jonnyfolk has asked for the wisdom of the Perl Monks concerning the following question:
Hi,
I'm using the HTML::TokeParser::Simple module (heartfelt thanks, ovid!) to extract text from various web pages:
This extracts the text for me but what I would like to be able to do is print certain tags as text so that a paragraph containing an underline, for example, would appear instead of two or three lines of text, as a single line: "this item <u>contains an</u> underline.use HTML::TokeParser::Simple; my @html_docs = ( '/home/site/www/content.html' ); my $lookup = #search phrase foreach my $doc ( @html_docs ) { my $p = HTML::TokeParser::Simple->new( file => $doc ); while ( my $token = $p->get_token ) { next unless $token->is_text; my $line = $token->as_is; if ($line =~ /$lookup/) { print qq~$line\n~; } } }
I suspect that I would need to use a call to: if ( $token->is_tag ) { ... } but I'm not sure what the requirement would be.
Any tips would be gratefully received
|
---|
Replies are listed 'Best First'. | |
---|---|
Re: Get HTML::TokeParser::Simple to interpret some tags as text
by bart (Canon) on May 23, 2006 at 08:27 UTC | |
by jonnyfolk (Vicar) on May 23, 2006 at 09:08 UTC |
Back to
Seekers of Perl Wisdom