Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

Get HTML::TokeParser::Simple to interpret some tags as text

by jonnyfolk (Vicar)
on May 23, 2006 at 08:16 UTC ( #551093=perlquestion: print w/ replies, xml ) Need Help??
jonnyfolk has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I'm using the HTML::TokeParser::Simple module (heartfelt thanks, ovid!) to extract text from various web pages:

use HTML::TokeParser::Simple; my @html_docs = ( '/home/site/www/content.html' ); my $lookup = #search phrase foreach my $doc ( @html_docs ) { my $p = HTML::TokeParser::Simple->new( file => $doc ); while ( my $token = $p->get_token ) { next unless $token->is_text; my $line = $token->as_is; if ($line =~ /$lookup/) { print qq~$line\n~; } } }
This extracts the text for me but what I would like to be able to do is print certain tags as text so that a paragraph containing an underline, for example, would appear instead of two or three lines of text, as a single line: "this item <u>contains an</u> underline.

I suspect that I would need to use a call to: if ( $token->is_tag ) { ... } but I'm not sure what the requirement would be.

Any tips would be gratefully received

Comment on Get HTML::TokeParser::Simple to interpret some tags as text
Select or Download Code
Re: Get HTML::TokeParser::Simple to interpret some tags as text
by bart (Canon) on May 23, 2006 at 08:27 UTC
    Untested:
    my %allowed = ( u => 1, i => 1, b => 1); my $text = ""; while ( my $token = $p->get_token ) { if($token->is_text) { $text .= $token->as_is; } elsif($token->is_tag && $allowed{$token->get_tag}) { $text .= $token->as_is; } }
    Now you can filter your text in $text, for example like:
    my @match = grep /$lookup/, split /\n/, $text; print "$_\n" foreach @match;
    There is, of course, the chance that you'll match on the tags now.
      OK, I get the idea now. Thanks bart - much appreciated.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://551093]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others studying the Monastery: (21)
As of 2015-07-01 15:38 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (6 votes), past polls