Beefy Boxes and Bandwidth Generously Provided by pair Networks Frank
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Re: pulling just text from a url

by graff (Chancellor)
on Mar 19, 2006 at 18:00 UTC ( [id://537819]=note: print w/replies, xml ) Need Help??

This is an archived low-energy page for bots and other anonmyous visitors. Please sign up if you are a human and want to interact.


in reply to pulling just text from a url

I've had success avoiding script and css data within web pages using HTML::TokeParser as follows:
# this assumes an html file in @ARGV or on STDIN: my $src; { # read the entire HTML input stream as one contiguous string: local $/ = undef; $src = <>; } my $htm = HTML::TokeParser->new( \$src ); my $inscript = 0; my $ignore = join '|', qw/script style cssheader/; while ( my $tkn = $htm->get_token ) { if ( $$tkn[0] eq 'S' and $$tkn[1] =~ /^(?:$ignore)$/ ) { $inscript++; # skip anything having to do with scripts, styl +es or css next; } elsif ( $$tkn[0] eq 'E' and $$tkn[1] =~ /^(?:$ignore)$/ ) { $inscript--; next; } elsif ( $$tkn[0] eq 'T' and ! $inscript ) { # we have text that is not part of scripting or styling, # so do something with this text... } }
This assumes the html input is well formed with respect to script, style and cssheader tags. Note that HTML::TokeParser isn't really any more complicated than HTML::TokeParser::Simple -- you just have to know the structure of the tokens that it returns, so that you can set up handlers for the different types (start tags flagged by  $$tkn[0] eq 'S', end tags by 'E', text data by 'T', etc, with tag name or text content stored in  $$tkn[1]).

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://537819]
help
Sections?
Information?
Find Nodes?
Leftovers?
    Notices?
    hippoepoptai's answer Re: how do I set a cookie and redirect was blessed by hippo!
    erzuuliAnonymous Monks are no longer allowed to use Super Search, due to an excessive use of this resource by robots.