http://www.perlmonks.org?node_id=11105743


in reply to I want to save web pages as text rather than as HTML.

Hello anautismobserver and welcome to the monastery and to the wonderful world of perl!

Perl is powerful enough to achieve this with a oneliner (pay attention to windows doublequotes)

perl -MHTML::TreeBuilder -e "print HTML::TreeBuilder->new_from_url('http://perl.org')->as_text"

The above combines two steps: getting the raw html content from the url (using LWP::UserAgent under the hood) and formatting the output as text.

Web scraping is a dark art and could be achieved in many distinct ways. You can follow some link in my bibliotheca: web scraping or visit previous threads like Re: How can I download HTML and save it as txt?

As you presented yourself as a principiant please note that the -M switch of perl import a module as described in perlrun and the concatenations of methods ( ->new_from_url(..)->as_text ) is just a shortcut to avoid unnecessary variable declaration.

PS you can also use other modules to do the web scrape part as suggested by Task::Kensho that is a fairly good collection of modules from CPAN. Also other modules are worth to try like Mojo::Dom or Web::Scraper as suggested in The State of Web spidering in Perl

L*

There are no rules, there are no thumbs..
Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.

Replies are listed 'Best First'.
Re^2: I want to save web pages as text rather than as HTML. -- oneliner
by daxim (Curate) on Sep 09, 2019 at 10:46 UTC
    Method text in WWW::Mechanize wraps that TreeBuilder code. This is useful to know because often times, one already works with Mechanize or a class derived from it.
Re^2: I want to save web pages as text rather than as HTML. -- oneliner
by anautismobserver (Sexton) on Sep 10, 2019 at 21:18 UTC

    Thanks for all that info. It's a lot to digest.

    Despite the elegance of a one-liner, I prefer to take one step at a time.

    When I try to run the following code:

    use strict; use warnings; use LWP::UserAgent; use LWP::Simple; use HTML::TreeBuilder; print HTML::TreeBuilder->new_from_url('http://perl.org')->as_text;

    I get the error message << Can't locate object method "new_from_url" via package "HTML::TreeBuilder" >>

    What else do I need to add to the code to make it work?

      Maybe you have a really old version and need an update. The method was added 2012-06-12 according to its change file. The example as you posted it works fine for me; relatively current Perl installation with HTML::TB version 5.03 on OS X.

        I was using Padre and DWIM Perl, following Gabor Szabo's instructions at https://perlmaven.com/installing-perl-and-getting-started

        I have just uninstalled DWIM Perl and installed Strawberry Perl 5.30.0.1 (64bit) for Windows into C:\Strawberry on my hard drive.

        The README.txt file tells me to run the following commands to manually set some environment variables:

        c:\myperl\relocation.pl.bat ... this is REQUIRED!

        c:\myperl\update_env.pl.bat ... this is OPTIONAL

        When I tried to run c:\Strawberry\relocation.pl.bat I got the error message "The system cannot find the path specified." Hovever, there was a file "relocation.txt" in C:\Strawberry which appeared to run successfully. I can't find any files similar to update_env.pl.bat however.

        I liked the convenince of Padre, but don't want to run into more problems due to not being kept up to date. I also have Notepad++. What to you recommend?

        Thanks for all your help. I'll probably have more questions later.

Re^2: I want to save web pages as text rather than as HTML. -- oneliner
by anautismobserver (Sexton) on Sep 11, 2019 at 19:18 UTC

    Now I have Strawberry Perl up and running and the previous TreeBuilder code example now works (using 'http://perl.org' as input).

    When I change the input to 'https://wordpress.com/read/feeds/94271045' using the following code:

    use strict; use warnings; use LWP::UserAgent; use LWP::Simple; use HTML::TreeBuilder; print HTML::TreeBuilder->new_from_url('https://wordpress.com/read/feed +s/94271045')->as_text;

    The output is << WordPress.comPlease enable JavaScript in your browser to enjoy WordPress.com. >>

    Do you know how to fix this? One complicating factor is that pages like https://wordpress.com/read/feeds/94271045 won't display properly in my browser unless I'm logged into a WordPress account.

    Thanks.