IB2017 has asked for the wisdom of the Perl Monks concerning the following question:
I want to extract the textual parts of HTML pages (i.e. the texts that a browser normally shows). I am using the following script, however with some pages it prints out code too. The script and the url can reproduce the issue. Any idea?
use strict; use warnings; use LWP::Simple; use LWP::UserAgent; use HTML::TokeParser::Simple; my $ua = LWP::UserAgent->new( ); my $text = do_GET_TXT("http://www.spacex.com/webcast"); print $text; sub do_GET_TXT { my ($url)=@_; print "Downloading and reading HTML $url...\n"; my $response = $ua->get($url); if ($response->is_error) { $response->code; } else{ my $HTML = $response->decoded_content(); my @text; require HTML::TokeParser::Simple; my $p = HTML::TokeParser::Simple->new(\$HTML); while ( my $token = $p->get_token ) { next unless $token->is_text; my $out = $token->as_is; $out =~ s/^\s+/\n/; push (@text, $out); } my $text = join("", @text); #some heuristics $text =~ s/\n+/\n/g; $text =~ s/\n(\d+\.)\n/$1\t/g; $text =~ s/(\(\d+\))\n/$1\t/g; $text =~ s/([a-z]\))\n/$1\t/g; return $text; } }
I obtain more or less the same using Mojo::DOM with the following:
my $dom = Mojo::DOM->new($HTML); my $text = $dom->all_text();
A bit better gets with the following (with reduces the parsing to the body, however snippets of javascript are still to be seen...
my $dom = Mojo::DOM->new($HTML); my $text = $dom->at('body')->all_text();
|
---|
Replies are listed 'Best First'. | |
---|---|
Re: getting text from HTML
by haukex (Archbishop) on May 03, 2020 at 21:42 UTC | |
by IB2017 (Pilgrim) on May 04, 2020 at 08:43 UTC | |
by haukex (Archbishop) on May 04, 2020 at 09:46 UTC | |
by IB2017 (Pilgrim) on May 03, 2020 at 22:58 UTC | |
Re: getting text from HTML
by perlfan (Vicar) on May 12, 2020 at 03:26 UTC | |
Re: getting text from HTML
by Anonymous Monk on May 04, 2020 at 16:57 UTC | |
by Your Mother (Archbishop) on May 12, 2020 at 17:21 UTC | |
by Anonymous Monk on May 12, 2020 at 10:13 UTC |
Back to
Seekers of Perl Wisdom