getting text from HTML

IB2017 has asked for the wisdom of the Perl Monks concerning the following question:

I want to extract the textual parts of HTML pages (i.e. the texts that a browser normally shows). I am using the following script, however with some pages it prints out code too. The script and the url can reproduce the issue. Any idea?

use strict;
use warnings;
use LWP::Simple;
use LWP::UserAgent;
use HTML::TokeParser::Simple;

my $ua = LWP::UserAgent->new( );
my $text = do_GET_TXT("http://www.spacex.com/webcast");
print $text;

sub do_GET_TXT {
    my ($url)=@_;
    print "Downloading and reading HTML $url...\n";
    my $response = $ua->get($url);
    if ($response->is_error) {
        $response->code;
    }
    else{
        my $HTML = $response->decoded_content();
        my @text;
        require HTML::TokeParser::Simple;
        my $p = HTML::TokeParser::Simple->new(\$HTML);
        while ( my $token = $p->get_token ) {
            next unless $token->is_text;
            my $out = $token->as_is;
            $out =~ s/^\s+/\n/;
            push (@text, $out);
        }
        my $text = join("", @text);
                #some heuristics
        $text =~  s/\n+/\n/g; 
        $text =~  s/\n(\d+\.)\n/$1\t/g; 
        $text =~  s/(\(\d+\))\n/$1\t/g; 
        $text =~  s/([a-z]\))\n/$1\t/g;  
        return $text;
    }
}
[download]

I obtain more or less the same using Mojo::DOM with the following:

        
my $dom = Mojo::DOM->new($HTML);
my $text = $dom->all_text();
[download]

A bit better gets with the following (with reduces the parsing to the body, however snippets of javascript are still to be seen...

my $dom = Mojo::DOM->new($HTML);
my $text = $dom->at('body')->all_text();
[download]

Comment on getting text from HTML Select or Download Code

Replies are listed 'Best First'.
Re: getting text from HTML by haukex (Archbishop) on May 03, 2020 at 21:42 UTC
I believe what you're seeing is the concept of the Document Object Model, where basically "text nodes" are anything that's not an element, including everything between `<script>` tags etc. One easy workaround is to clobber all the tags you don't want: use Mojo::Base -strict; use open qw/:std :utf8/; use Mojo::UserAgent; my $ua = Mojo::UserAgent->new( max_redirects => 3 ); my $res = $ua->get('http://www.spacex.com/webcast')->result; die $res->message unless $res->is_success; my $dom = $res->dom; $dom->find('script, style')->map('remove'); my $text = $dom->at('body')->all_text; 1 while $text =~ s/\s{2,}/ /g; say $text; __END__ Jump to navigation Falcon 9 Falcon Heavy Dragon Starship Updates About SpaceX Careers Shop You are hereHome STARLINK MISSION On Wednesday, April 22 at 3:30 +p.m. EDT, or 19:30 p.m. UTC, SpaceX launched its seventh Starlink mis +sion. Falcon 9 lifted off from Launch Complex 39A (LC-39A) at NASA’s +Kennedy Space Center in Florida.Falcon 9’s first stage previously sup +ported Crew Dragon’s first flight to the International Space Station, + launch of the RADARSAT Constellation Mission, and the fourth Starlin +k mission. Following stage separation, SpaceX landed Falcon 9’s first + stage on the “Of Course I Still Love You” droneship, which was stati +oned in the Atlantic Ocean. Falcon 9’s fairing previously supported t +he AMOS-17 mission. You can watch a replay of the launch below and le +arn more about the mission here. \| Twitter YouTube Flickr Instagram P +rivacy © 2020 Space Exploration Technologies Corp. [download]	[reply] [d/l] [select]
Re^2: getting text from HTML by IB2017 (Pilgrim) on May 04, 2020 at 08:43 UTC
I am further experimenting with your great solution. I have an issue with the text being concatenated. Is there a way to separate, let's say with a simple white space, the text snippets the script extracts from the different sections of the page? If you look at the result you get, first line, you can see `You are hereHome` which should be separated. I can't see any option for `my $text = $dom->all_text;` (besides the trim `all_text(0);` which does not apply here) Of course I can go with something like `$text = $res->dom('h1, h2, h3, p')->each(sub { say 'text: ', shift->al +l_text });` [download] I am starting to love Mojo...	[reply] [d/l] [select]
Re^3: getting text from HTML (updated x2) by haukex (Archbishop) on May 04, 2020 at 09:46 UTC
Looking at the code of Mojo::DOM, it doesn't look like it's directly supported. But luckily it's not too difficult to add (you can of course put the `package` into its own `.pm` file): Update: I've modified the methods so that they return a nested set of Mojo::Collection objects of the callback results, so that `walk` is kind of like a tree-based `map`. Update 2: For an even more refined version, see here. use Mojo::Base -strict; use 5.014; use Mojo::UserAgent; use Mojo::DOM; use Mojo::Util qw/dumper/; package Mojo::DOM::Role::TreeWalker { use Mojo::Base -strict; use Role::Tiny; use Mojo::Collection; sub walk { $_[0]->_walk($_[1], 0) \|\| Mojo::Collection->new } sub _walk { my ($self, $cb, $depth) = @_; my $c = Mojo::Collection->new; { local $_ = $self; push @$c, $cb->($self, $depth++); } my $rv = $self->child_nodes->map('_walk', $cb, $depth); push @$c, $rv if @$rv; @$c ? $c : (); } sub walk_text { my ($self, $cb) = @_; $self->walk(sub { $_->type eq 'cdata' \|\| $_->type eq 'raw' \|\| $_->type eq 'text' ? $cb->(@_) : () }); } } my $ua = Mojo::UserAgent->new( max_redirects => 3 ); my $res = $ua->get('http://www.spacex.com/webcast')->result; die $res->message unless $res->is_success; my $dom = $res->dom; $dom->find('script, style')->map('remove'); my $texts = $dom->with_roles('+TreeWalker')->walk_text(sub { $_->content=~/\S/ ? $_->content=~s/^\s+\|\s+$//gr : () })->flatten; print dumper $texts; __END__ bless( [ "STARLINK MISSION \| SpaceX", "Jump to navigation", "Falcon 9", "Falcon Heavy", "Dragon", "Starship", "Updates", "About SpaceX", "Careers", "Shop", "You are here", "Home", "STARLINK MISSION", "On Wednesday, April 22 at 3:30 p.m. EDT, or 19:30 p.m. UTC, SpaceX +launched its seventh Starlink mission. Falcon 9 lifted off from Launc +h Complex 39A (LC-39A) at NASA\x{2019}s Kennedy Space Center in Flori +da.", "Falcon 9\x{2019}s first stage previously supported Crew Dragon\x{20 +19}s first flight to the International Space Station, launch of the R +ADARSAT Constellation Mission, and the fourth Starlink mission. Follo +wing stage separation, SpaceX landed Falcon 9\x{2019}s first stage on + the \x{201c}Of Course I Still Love You\x{201d} droneship, which was +stationed in the Atlantic Ocean. Falcon 9\x{2019}s fairing previously + supported the AMOS-17 mission.", "You can watch a replay of the launch below and learn more about the + mission", "here.", "\|", "Twitter", "YouTube", "Flickr", "Instagram", "Privacy", "\x{a9} 2020 Space Exploration Technologies Corp." ], 'Mojo::Collection' ) [download]	[reply] [d/l] [select]
Re^2: getting text from HTML by IB2017 (Pilgrim) on May 03, 2020 at 22:58 UTC
Very nice thank you. I am also experimenting with the Mojo::UserAgent which seems more modern the the one I was using (all the time).	[reply]
Re: getting text from HTML by perlfan (Vicar) on May 12, 2020 at 03:26 UTC
Shout out to Web::Scraper. It takes some time to wrap your head around it, but it's pretty good for writing robust scrapers. Relying on the DOM itself makes your scraper very brittle, especially if this is an HTML source that you do not control.	[reply]
Re: getting text from HTML by Anonymous Monk on May 04, 2020 at 16:57 UTC
If you have raw HTML and need to get anything-at-all out of it, HTML::Parser is the best way to go. It will process real-world HTML content of arbitrary size and, in this case, call a `text` event handler each time a block of text is found.	[reply]
Re^2: getting text from HTML by Your Mother (Archbishop) on May 12, 2020 at 17:21 UTC
HTML::Parser is pretty far down on the list of things you should recommend, especially to a newish Perl hacker. The OP was already trying HTML::TokeParser::Simple which has a better, higher level, interface, and does the same things. I’m also going to critique answers that come without code and use language like “raw HTML” and “real-world” and “arbitrary size.” At the very best, it’s unhelpful. At face level, it’s detrimental to wisdom seekers.	[reply]
Re^2: getting text from HTML by Anonymous Monk on May 12, 2020 at 10:13 UTC
Wrong. Html parser is too low level.	[reply]

Back to Seekers of Perl Wisdom