http://www.perlmonks.org?node_id=11105743


in reply to I want to save web pages as text rather than as HTML.

Hello anautismobserver and welcome to the monastery and to the wonderful world of perl!

Perl is powerful enough to achieve this with a oneliner (pay attention to windows doublequotes)

perl -MHTML::TreeBuilder -e "print HTML::TreeBuilder->new_from_url('http://perl.org')->as_text"

The above combines two steps: getting the raw html content from the url (using LWP::UserAgent under the hood) and formatting the output as text.

Web scraping is a dark art and could be achieved in many distinct ways. You can follow some link in my bibliotheca: web scraping or visit previous threads like Re: How can I download HTML and save it as txt?

As you presented yourself as a principiant please note that the -M switch of perl import a module as described in perlrun and the concatenations of methods ( ->new_from_url(..)->as_text ) is just a shortcut to avoid unnecessary variable declaration.

PS you can also use other modules to do the web scrape part as suggested by Task::Kensho that is a fairly good collection of modules from CPAN. Also other modules are worth to try like Mojo::Dom or Web::Scraper as suggested in The State of Web spidering in Perl

L*

There are no rules, there are no thumbs..
Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.