Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

pulling just text from a url

by coldfingertips (Pilgrim)
on Mar 19, 2006 at 21:16 UTC ( #537802=perlquestion: print w/ replies, xml ) Need Help??
coldfingertips has asked for the wisdom of the Perl Monks concerning the following question:

I need an accurate way to pull just the readable text from a web page. I was told HTML::TokeParser / Simple would work. The thing is, it's bringing back some css and javascript tags too, including Google Ad source code.

On top of this, there is a lot of   and li tags in the page dump, too. I can filter these out I suppose in regexes, but there's no way I can account for everything that this module misses.

Also, it misprints some data, too. The below script prints '0Items in cart' for example, there IS a space there on the page.

Is there an accurate way to do this?

#!/usr/bin/perl use warnings; use strict; my $url = "http://www.sensationalscentsonline.com"; my $page_source; use HTML::TokeParser::Simple; my $p = HTML::TokeParser::Simple->new(url => $url); while ( my $token = $p->get_token ) { # This prints all text in an HTML doc (i.e., it strips the HTML) next unless $token->is_text; $page_source .= $token->as_is if $token->as_is !~ m/^</; } print $page_source;

Comment on pulling just text from a url
Download Code
Re: pulling just text from a url
by ghenry (Vicar) on Mar 19, 2006 at 21:26 UTC

    This should be simple with HTML::Parser.

    See the Extract all plain text from an HTML file example to start with, then read the main docs to "fine tune" things.

    HTH.

    Walking the road to enlightenment... I found a penguin and a camel on the way.....
    Fancy a yourname@perl.me.uk? Just ask!!!
Re: pulling just text from a url
by sulfericacid (Deacon) on Mar 19, 2006 at 21:27 UTC
    I'm not very familiar with this mod myself, but to fix your spacing issue you need to simple add the space.
    $page_source = $page_source . " " . $token->as_is if $token->as_is !~ +m/^</;


    "Age is nothing more than an inaccurate number bestowed upon us at birth as just another means for others to judge and classify us"

    sulfericacid
Re: pulling just text from a url
by graff (Chancellor) on Mar 19, 2006 at 23:00 UTC
    I've had success avoiding script and css data within web pages using HTML::TokeParser as follows:
    # this assumes an html file in @ARGV or on STDIN: my $src; { # read the entire HTML input stream as one contiguous string: local $/ = undef; $src = <>; } my $htm = HTML::TokeParser->new( \$src ); my $inscript = 0; my $ignore = join '|', qw/script style cssheader/; while ( my $tkn = $htm->get_token ) { if ( $$tkn[0] eq 'S' and $$tkn[1] =~ /^(?:$ignore)$/ ) { $inscript++; # skip anything having to do with scripts, styl +es or css next; } elsif ( $$tkn[0] eq 'E' and $$tkn[1] =~ /^(?:$ignore)$/ ) { $inscript--; next; } elsif ( $$tkn[0] eq 'T' and ! $inscript ) { # we have text that is not part of scripting or styling, # so do something with this text... } }
    This assumes the html input is well formed with respect to script, style and cssheader tags. Note that HTML::TokeParser isn't really any more complicated than HTML::TokeParser::Simple -- you just have to know the structure of the tokens that it returns, so that you can set up handlers for the different types (start tags flagged by  $$tkn[0] eq 'S' , end tags by 'E', text data by 'T', etc, with tag name or text content stored in  $$tkn[1] ).
Re: pulling just text from a url
by PodMaster (Abbot) on Mar 20, 2006 at 08:06 UTC

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://537802]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others having an uproarious good time at the Monastery: (16)
As of 2015-07-01 15:14 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (3 votes), past polls