Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

How can I download HTML and save it as txt?

by tassex (Initiate)
on Aug 30, 2005 at 21:14 UTC ( #487945=perlquestion: print w/ replies, xml ) Need Help??
tassex has asked for the wisdom of the Perl Monks concerning the following question:

Hello. Can someone tell me how to download an HTML page, but save it as a txt file? I do NOT want to save a page that already exist on my PC (e.g. open html file, open txt empty, swap values,save txt file) . I want something like "Save as txt" feature. The user input should be the source and the result the txt file with the contents of the html. I assume that the initial html page will contain only text. Note: If you feel like not posting code for my query, at least (please) tell me the exact tutorials for reading to do so.

Comment on How can I download HTML and save it as txt?
Re: How can I download HTML and save it as txt?
by InfiniteSilence (Curate) on Aug 30, 2005 at 21:19 UTC
    perl -e "use LWP::Simple; getprint('http://myfoo.com')" >> myfile.txt

    Celebrate Intellectual Diversity

      This would be nicer, IMHO:

      perl -MLWP::Simple -e"getstore('http://myfoo.com', 'myfile.txt')"

      Not sure what the tassex really wants though. Do you (tassex) want to store the HTML in a .txt file (like above), or do you want to strip the HTML and save the text?

      --
      b10m

      All code is usually tested, but rarely trusted.
Re: How can I download HTML and save it as txt?
by jeffa (Chancellor) on Aug 30, 2005 at 21:22 UTC

    You can always use your browser of choice -- they have a 'Save As' and you can choose 'As Text'. Something tells me this is not sufficient enough for you, however. The Perl Cookbook has a recipe devoted to converting HTML to ASCII. This is straight from the first edition, Recipe 20.5:

    use HTML::FormatText; use HTML::Parse; $html = parse_htmlfile($filename); $formatter = HTML::FormatText->new(leftmargin => 0, rightmargin => 50) +; $ascii = $formatter->format($html);

    jeffa

    L-LL-L--L-LL-L--L-LL-L--
    -R--R-RR-R--R-RR-R--R-RR
    B--B--B--B--B--B--B--B--
    H---H---H---H---H---H---
    (the triplet paradiddle with high-hat)
    
Re: How can I download HTML and save it as txt?
by ikegami (Pope) on Aug 30, 2005 at 21:23 UTC

    I have problems understanding your question, but at least one of the following modules should help you.

    Any of LWP::Simple, LWP::UserAgent and WWW::Mechanize will help you download a web page.

    As for converting the HTML to text, HTML::FormatText and possibly HTML::FormatText::WithLinks should be of interest.

    Update: I see others have already posted answers. InfiniteSilence posted an example of downloading a web page and saving it as HTML in a file with the extention .txt. jeffa posted an example of converting HTML to text. Pick and choose what you want.

Re: How can I download HTML and save it as txt?
by chanio (Priest) on Aug 30, 2005 at 21:27 UTC
    If you don't want to save it like text from your browser, you could do like this...

    Create a new printer but choose a plain text printer, and add the option to save it as a file instead of printing it. Then, when you choose /Print at any app. you are going to be prompted for the name and location of your text file. And that's it! It would write it as if you had an old plain text matrix printer (without any graphics)...

    But it is better to have Firefox and it's incredible extensions (Copy to...).

    { \ ( ' v ' ) / }
    ( \ _ / ) _ _ _ _ ` ( ) ' _ _ _ _
    ( = ( ^ Y ^ ) = ( _ _ ^ ^ ^ ^
    _ _ _ _ \ _ ( m _ _ _ m ) _ _ _ _ _ _ _ _ _ ) c h i a n o , a l b e r t o
    Wherever I lay my KNOPPIX disk, a new FREE LINUX nation could be established
Re: How can I download HTML and save it as txt?
by trammell (Priest) on Aug 30, 2005 at 21:40 UTC
    % lynx -dump www.google.com > google.txt
Re: How can I download HTML and save it as txt?
by tassex (Initiate) on Aug 30, 2005 at 22:06 UTC
    Cheers to ALL of you!! More than one way to do so.. :)
Re: How can I download HTML and save it as txt?
by jdporter (Canon) on Aug 31, 2005 at 04:48 UTC
    Here's one way I can think of:
    use LWP::Simple; use HTML::TreeBuilder; use IO::File; IO::File ->new( "> $file" ) ->print( HTML::TreeBuilder ->new_from_content( get $url ) ->as_text )
    Short and sweet. But lacking any kind of error handling. :-(
Re: How can I download HTML and save it as txt?
by CountZero (Bishop) on Aug 31, 2005 at 05:58 UTC
    To let you in on a big secret: HTML files are already text! You will be hard pressed to save them as anything else than text.

    CountZero

    "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law

Re: How can I download HTML and save it as txt?
by Anonymous Monk on Aug 31, 2005 at 07:03 UTC
    lwp-request -m get -o text http://myfoo.com >myfile.txt

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://487945]
Approved by b10m
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others imbibing at the Monastery: (9)
As of 2014-07-11 00:22 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    When choosing user names for websites, I prefer to use:








    Results (217 votes), past polls