in reply to I want to save web pages as text rather than as HTML.
Perl is powerful enough to achieve this with a oneliner (pay attention to windows doublequotes)
perl -MHTML::TreeBuilder -e "print HTML::TreeBuilder->new_from_url('http://perl.org')->as_text"
The above combines two steps: getting the raw html content from the url (using LWP::UserAgent under the hood) and formatting the output as text.
Web scraping is a dark art and could be achieved in many distinct ways. You can follow some link in my bibliotheca: web scraping or visit previous threads like Re: How can I download HTML and save it as txt?
As you presented yourself as a principiant please note that the -M switch of perl import a module as described in perlrun and the concatenations of methods ( ->new_from_url(..)->as_text ) is just a shortcut to avoid unnecessary variable declaration.
PS you can also use other modules to do the web scrape part as suggested by Task::Kensho that is a fairly good collection of modules from CPAN. Also other modules are worth to try like Mojo::Dom or Web::Scraper as suggested in The State of Web spidering in Perl
L*
Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.
|
---|