Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

pulling just text from a url

by coldfingertips (Pilgrim)
on Mar 19, 2006 at 21:16 UTC ( #537802=perlquestion: print w/ replies, xml ) Need Help??
coldfingertips has asked for the wisdom of the Perl Monks concerning the following question:

I need an accurate way to pull just the readable text from a web page. I was told HTML::TokeParser / Simple would work. The thing is, it's bringing back some css and javascript tags too, including Google Ad source code.

On top of this, there is a lot of   and li tags in the page dump, too. I can filter these out I suppose in regexes, but there's no way I can account for everything that this module misses.

Also, it misprints some data, too. The below script prints '0Items in cart' for example, there IS a space there on the page.

Is there an accurate way to do this?

#!/usr/bin/perl use warnings; use strict; my $url = "http://www.sensationalscentsonline.com"; my $page_source; use HTML::TokeParser::Simple; my $p = HTML::TokeParser::Simple->new(url => $url); while ( my $token = $p->get_token ) { # This prints all text in an HTML doc (i.e., it strips the HTML) next unless $token->is_text; $page_source .= $token->as_is if $token->as_is !~ m/^</; } print $page_source;

Comment on pulling just text from a url
Download Code
Re: pulling just text from a url
by ghenry (Vicar) on Mar 19, 2006 at 21:26 UTC

    This should be simple with HTML::Parser.

    See the Extract all plain text from an HTML file example to start with, then read the main docs to "fine tune" things.

    HTH.

    Walking the road to enlightenment... I found a penguin and a camel on the way.....
    Fancy a yourname@perl.me.uk? Just ask!!!
Re: pulling just text from a url
by sulfericacid (Deacon) on Mar 19, 2006 at 21:27 UTC
    I'm not very familiar with this mod myself, but to fix your spacing issue you need to simple add the space.
    $page_source = $page_source . " " . $token->as_is if $token->as_is !~ +m/^</;


    "Age is nothing more than an inaccurate number bestowed upon us at birth as just another means for others to judge and classify us"

    sulfericacid
Re: pulling just text from a url
by graff (Chancellor) on Mar 19, 2006 at 23:00 UTC
    I've had success avoiding script and css data within web pages using HTML::TokeParser as follows:
    # this assumes an html file in @ARGV or on STDIN: my $src; { # read the entire HTML input stream as one contiguous string: local $/ = undef; $src = <>; } my $htm = HTML::TokeParser->new( \$src ); my $inscript = 0; my $ignore = join '|', qw/script style cssheader/; while ( my $tkn = $htm->get_token ) { if ( $$tkn[0] eq 'S' and $$tkn[1] =~ /^(?:$ignore)$/ ) { $inscript++; # skip anything having to do with scripts, styl +es or css next; } elsif ( $$tkn[0] eq 'E' and $$tkn[1] =~ /^(?:$ignore)$/ ) { $inscript--; next; } elsif ( $$tkn[0] eq 'T' and ! $inscript ) { # we have text that is not part of scripting or styling, # so do something with this text... } }
    This assumes the html input is well formed with respect to script, style and cssheader tags. Note that HTML::TokeParser isn't really any more complicated than HTML::TokeParser::Simple -- you just have to know the structure of the tokens that it returns, so that you can set up handlers for the different types (start tags flagged by  $$tkn[0] eq 'S' , end tags by 'E', text data by 'T', etc, with tag name or text content stored in  $$tkn[1] ).
Re: pulling just text from a url
by PodMaster (Abbot) on Mar 20, 2006 at 08:06 UTC

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://537802]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others imbibing at the Monastery: (15)
As of 2014-08-27 16:15 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The best computer themed movie is:











    Results (244 votes), past polls