Beefy Boxes and Bandwidth Generously Provided by pair Networks Frank
laziness, impatience, and hubris
 
PerlMonks  

Get HTML::TokeParser::Simple to interpret some tags as text

by jonnyfolk (Vicar)
on May 23, 2006 at 08:16 UTC ( #551093=perlquestion: print w/ replies, xml ) Need Help??
jonnyfolk has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I'm using the HTML::TokeParser::Simple module (heartfelt thanks, ovid!) to extract text from various web pages:

use HTML::TokeParser::Simple; my @html_docs = ( '/home/site/www/content.html' ); my $lookup = #search phrase foreach my $doc ( @html_docs ) { my $p = HTML::TokeParser::Simple->new( file => $doc ); while ( my $token = $p->get_token ) { next unless $token->is_text; my $line = $token->as_is; if ($line =~ /$lookup/) { print qq~$line\n~; } } }
This extracts the text for me but what I would like to be able to do is print certain tags as text so that a paragraph containing an underline, for example, would appear instead of two or three lines of text, as a single line: "this item <u>contains an</u> underline.

I suspect that I would need to use a call to: if ( $token->is_tag ) { ... } but I'm not sure what the requirement would be.

Any tips would be gratefully received

Comment on Get HTML::TokeParser::Simple to interpret some tags as text
Select or Download Code
Re: Get HTML::TokeParser::Simple to interpret some tags as text
by bart (Canon) on May 23, 2006 at 08:27 UTC
    Untested:
    my %allowed = ( u => 1, i => 1, b => 1); my $text = ""; while ( my $token = $p->get_token ) { if($token->is_text) { $text .= $token->as_is; } elsif($token->is_tag && $allowed{$token->get_tag}) { $text .= $token->as_is; } }
    Now you can filter your text in $text, for example like:
    my @match = grep /$lookup/, split /\n/, $text; print "$_\n" foreach @match;
    There is, of course, the chance that you'll match on the tags now.
      OK, I get the idea now. Thanks bart - much appreciated.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://551093]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others surveying the Monastery: (11)
As of 2014-04-17 10:00 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    April first is:







    Results (443 votes), past polls