Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight

Re: How to save a web page directly to plain text?

by jdporter (Chancellor)
on Mar 19, 2003 at 06:35 UTC ( #244264=note: print w/replies, xml ) Need Help??

in reply to How to save a web page directly to plain text?

There is! One very easy way is to use the lynx text-mode browser to retrieve and save the file, using its -dump option. Not only does it print just the text, but attempts to format it (somewhat crudely) according to the html markup. lynx is available for most platforms, but unless you already have it, you might not consider this option "easy".

Another way is to use LWP (or LWP::Simple) to retrieve the file, and one of the HTML parsing modules (such as HTML::TreeBuilder) to parse the text out of it. For example:
my $URL = shift or die "Usage: $0 URL\n"; use LWP::Simple; use HTML::TreeBuilder; print HTML::TreeBuilder ->new_from_content( get( $URL ) or die "Error getting $URL\n" ) ->as_trimmed_text;

The 6th Rule of Perl Club is -- There is no Rule #6.

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://244264]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others taking refuge in the Monastery: (3)
As of 2023-06-10 10:32 GMT
Find Nodes?
    Voting Booth?
    How often do you go to conferences?

    Results (38 votes). Check out past polls.