Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

How to save a web page directly to plain text?

by tsumay (Initiate)
on Mar 19, 2003 at 05:55 UTC ( #244253=perlquestion: print w/replies, xml ) Need Help??

tsumay has asked for the wisdom of the Perl Monks concerning the following question:

Is there any way to save web pages directly into plain text format instead of having to save it to HTML then stripping out the tags? What I'm doing right now is saving web pages into HTML, then stripping out the tags. Then I thought, there has to be a more efficient way of doing this. Is there?

2006-10-25 Retitled by planetscape, as per Monastery guidelines: one-word (or module-only) titles hinder site navigation

( keep:2 edit:21 reap:0 )

Original title: 'HTMLtoText'

  • Comment on How to save a web page directly to plain text?

Replies are listed 'Best First'.
Re: How to save a web page directly to plain text?
by jdporter (Chancellor) on Mar 19, 2003 at 06:35 UTC
    There is! One very easy way is to use the lynx text-mode browser to retrieve and save the file, using its -dump option. Not only does it print just the text, but attempts to format it (somewhat crudely) according to the html markup. lynx is available for most platforms, but unless you already have it, you might not consider this option "easy".

    Another way is to use LWP (or LWP::Simple) to retrieve the file, and one of the HTML parsing modules (such as HTML::TreeBuilder) to parse the text out of it. For example:
    my $URL = shift or die "Usage: $0 URL\n"; use LWP::Simple; use HTML::TreeBuilder; print HTML::TreeBuilder ->new_from_content( get( $URL ) or die "Error getting $URL\n" ) ->as_trimmed_text;

    jdporter
    The 6th Rule of Perl Club is -- There is no Rule #6.

Re: How to save a web page directly to plain text?
by allolex (Curate) on Mar 19, 2003 at 06:42 UTC

    How many pages? All browsers that I have ever used allow saving to plain text format. I frequently use lynx to do this sort of thing, and even better for getting pages that have frames (and no alternative) is links. The --dump option is the same.

    --
    Allolex

Re: How to save a web page directly to plain text?
by drfrog (Deacon) on Mar 19, 2003 at 18:51 UTC
    depending on how much work you have to do,
    you might want HTML::Parser to step up for yah.

    more on it at cpan of course

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://244253]
Approved by diotalevi
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others browsing the Monastery: (4)
As of 2023-03-26 18:13 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Which type of climate do you prefer to live in?






    Results (63 votes). Check out past polls.

    Notices?