Re: I want to save web pages as text rather than as HTML. -- onelinerby Discipulus (Abbot)
|on Sep 06, 2019 at 19:55 UTC||Need Help??|
Hello anautismobserver and welcome to the monastery and to the wonderful world of perl!
Perl is powerful enough to achieve this with a oneliner (pay attention to windows doublequotes)
perl -MHTML::TreeBuilder -e "print HTML::TreeBuilder->new_from_url('http://perl.org')->as_text"
The above combines two steps: getting the raw html content from the url (using LWP::UserAgent under the hood) and formatting the output as text.
Web scraping is a dark art and could be achieved in many distinct ways. You can follow some link in my bibliotheca: web scraping or visit previous threads like Re: How can I download HTML and save it as txt?
As you presented yourself as a principiant please note that the -M switch of perl import a module as described in perlrun and the concatenations of methods ( ->new_from_url(..)->as_text ) is just a shortcut to avoid unnecessary variable declaration.
PS you can also use other modules to do the web scrape part as suggested by Task::Kensho that is a fairly good collection of modules from CPAN. Also other modules are worth to try like Mojo::Dom or Web::Scraper as suggested in The State of Web spidering in Perl
There are no rules, there are no thumbs..
Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.