http://www.perlmonks.org?node_id=11116409

IB2017 has asked for the wisdom of the Perl Monks concerning the following question:

I want to extract the textual parts of HTML pages (i.e. the texts that a browser normally shows). I am using the following script, however with some pages it prints out code too. The script and the url can reproduce the issue. Any idea?

use strict; use warnings; use LWP::Simple; use LWP::UserAgent; use HTML::TokeParser::Simple; my $ua = LWP::UserAgent->new( ); my $text = do_GET_TXT("http://www.spacex.com/webcast"); print $text; sub do_GET_TXT { my ($url)=@_; print "Downloading and reading HTML $url...\n"; my $response = $ua->get($url); if ($response->is_error) { $response->code; } else{ my $HTML = $response->decoded_content(); my @text; require HTML::TokeParser::Simple; my $p = HTML::TokeParser::Simple->new(\$HTML); while ( my $token = $p->get_token ) { next unless $token->is_text; my $out = $token->as_is; $out =~ s/^\s+/\n/; push (@text, $out); } my $text = join("", @text); #some heuristics $text =~ s/\n+/\n/g; $text =~ s/\n(\d+\.)\n/$1\t/g; $text =~ s/(\(\d+\))\n/$1\t/g; $text =~ s/([a-z]\))\n/$1\t/g; return $text; } }

I obtain more or less the same using Mojo::DOM with the following:

my $dom = Mojo::DOM->new($HTML); my $text = $dom->all_text();

A bit better gets with the following (with reduces the parsing to the body, however snippets of javascript are still to be seen...

my $dom = Mojo::DOM->new($HTML); my $text = $dom->at('body')->all_text();

Replies are listed 'Best First'.
Re: getting text from HTML
by haukex (Archbishop) on May 03, 2020 at 21:42 UTC

    I believe what you're seeing is the concept of the Document Object Model, where basically "text nodes" are anything that's not an element, including everything between <script> tags etc. One easy workaround is to clobber all the tags you don't want:

    use Mojo::Base -strict; use open qw/:std :utf8/; use Mojo::UserAgent; my $ua = Mojo::UserAgent->new( max_redirects => 3 ); my $res = $ua->get('http://www.spacex.com/webcast')->result; die $res->message unless $res->is_success; my $dom = $res->dom; $dom->find('script, style')->map('remove'); my $text = $dom->at('body')->all_text; 1 while $text =~ s/\s{2,}/ /g; say $text; __END__ Jump to navigation Falcon 9 Falcon Heavy Dragon Starship Updates About SpaceX Careers Shop You are hereHome STARLINK MISSION On Wednesday, April 22 at 3:30 +p.m. EDT, or 19:30 p.m. UTC, SpaceX launched its seventh Starlink mis +sion. Falcon 9 lifted off from Launch Complex 39A (LC-39A) at NASA’s +Kennedy Space Center in Florida.Falcon 9’s first stage previously sup +ported Crew Dragon’s first flight to the International Space Station, + launch of the RADARSAT Constellation Mission, and the fourth Starlin +k mission. Following stage separation, SpaceX landed Falcon 9’s first + stage on the “Of Course I Still Love You” droneship, which was stati +oned in the Atlantic Ocean. Falcon 9’s fairing previously supported t +he AMOS-17 mission. You can watch a replay of the launch below and le +arn more about the mission here. | Twitter YouTube Flickr Instagram P +rivacy © 2020 Space Exploration Technologies Corp.

      I am further experimenting with your great solution. I have an issue with the text being concatenated. Is there a way to separate, let's say with a simple white space, the text snippets the script extracts from the different sections of the page? If you look at the result you get, first line, you can see You are hereHome which should be separated. I can't see any option for my $text = $dom->all_text; (besides the trim all_text(0); which does not apply here)

      Of course I can go with something like

      $text = $res->dom('h1, h2, h3, p')->each(sub { say 'text: ', shift->al +l_text });

      I am starting to love Mojo...

        Looking at the code of Mojo::DOM, it doesn't look like it's directly supported. But luckily it's not too difficult to add (you can of course put the package into its own .pm file):

        Update: I've modified the methods so that they return a nested set of Mojo::Collection objects of the callback results, so that walk is kind of like a tree-based map.

        Update 2: For an even more refined version, see here.

        use Mojo::Base -strict; use 5.014; use Mojo::UserAgent; use Mojo::DOM; use Mojo::Util qw/dumper/; package Mojo::DOM::Role::TreeWalker { use Mojo::Base -strict; use Role::Tiny; use Mojo::Collection; sub walk { $_[0]->_walk($_[1], 0) || Mojo::Collection->new } sub _walk { my ($self, $cb, $depth) = @_; my $c = Mojo::Collection->new; { local $_ = $self; push @$c, $cb->($self, $depth++); } my $rv = $self->child_nodes->map('_walk', $cb, $depth); push @$c, $rv if @$rv; @$c ? $c : (); } sub walk_text { my ($self, $cb) = @_; $self->walk(sub { $_->type eq 'cdata' || $_->type eq 'raw' || $_->type eq 'text' ? $cb->(@_) : () }); } } my $ua = Mojo::UserAgent->new( max_redirects => 3 ); my $res = $ua->get('http://www.spacex.com/webcast')->result; die $res->message unless $res->is_success; my $dom = $res->dom; $dom->find('script, style')->map('remove'); my $texts = $dom->with_roles('+TreeWalker')->walk_text(sub { $_->content=~/\S/ ? $_->content=~s/^\s+|\s+$//gr : () })->flatten; print dumper $texts; __END__ bless( [ "STARLINK MISSION | SpaceX", "Jump to navigation", "Falcon 9", "Falcon Heavy", "Dragon", "Starship", "Updates", "About SpaceX", "Careers", "Shop", "You are here", "Home", "STARLINK MISSION", "On Wednesday, April 22 at 3:30 p.m. EDT, or 19:30 p.m. UTC, SpaceX +launched its seventh Starlink mission. Falcon 9 lifted off from Launc +h Complex 39A (LC-39A) at NASA\x{2019}s Kennedy Space Center in Flori +da.", "Falcon 9\x{2019}s first stage previously supported Crew Dragon\x{20 +19}s first flight to the International Space Station, launch of the R +ADARSAT Constellation Mission, and the fourth Starlink mission. Follo +wing stage separation, SpaceX landed Falcon 9\x{2019}s first stage on + the \x{201c}Of Course I Still Love You\x{201d} droneship, which was +stationed in the Atlantic Ocean. Falcon 9\x{2019}s fairing previously + supported the AMOS-17 mission.", "You can watch a replay of the launch below and learn more about the + mission", "here.", "|", "Twitter", "YouTube", "Flickr", "Instagram", "Privacy", "\x{a9} 2020 Space Exploration Technologies Corp." ], 'Mojo::Collection' )

      Very nice thank you. I am also experimenting with the Mojo::UserAgent which seems more modern the the one I was using (all the time).

Re: getting text from HTML
by perlfan (Vicar) on May 12, 2020 at 03:26 UTC
    Shout out to Web::Scraper. It takes some time to wrap your head around it, but it's pretty good for writing robust scrapers. Relying on the DOM itself makes your scraper very brittle, especially if this is an HTML source that you do not control.
Re: getting text from HTML
by Anonymous Monk on May 04, 2020 at 16:57 UTC
    If you have raw HTML and need to get anything-at-all out of it, HTML::Parser is the best way to go. It will process real-world HTML content of arbitrary size and, in this case, call a text event handler each time a block of text is found.

      HTML::Parser is pretty far down on the list of things you should recommend, especially to a newish Perl hacker. The OP was already trying HTML::TokeParser::Simple which has a better, higher level, interface, and does the same things. I’m also going to critique answers that come without code and use language like “raw HTML” and “real-world” and “arbitrary size.” At the very best, it’s unhelpful. At face level, it’s detrimental to wisdom seekers.

      Wrong. Html parser is too low level.