Capturing web pages and making them static

Biff has asked for the wisdom of the Perl Monks concerning the following question:

Before I ask my question here's a little background in case I'm just being stupid in my attempted implementation. We have reports that we need to make available to manglers :) but the reports are big and slow, etc. The poor manglers get impatient and un-happy.

My mission is to generate these reports at night and save static copies of them for the mangler's viewing pleasure.

What I'm trying to emulate is the capability that most browser's have of saving a web page as a complete static entity. When you do this in a browser you get the page saved as HTML and a folder full of goodies and other artifacts that enable you to view the page statically just like you would have off of the server.

Will a simple LWP get do this? I kinda doubt it and was hoping to avoid blind alleys.

thanks,

Biff

Comment on Capturing web pages and making them static

Replies are listed 'Best First'.
Re: Capturing web pages and making them static by perrin (Chancellor) on Aug 04, 2004 at 21:24 UTC
You can use LWP, probably with the lwp-mirror tool, or you can use wget or curl.	[reply]
Re: Capturing web pages and making them static by Aristotle (Chancellor) on Aug 04, 2004 at 22:20 UTC
This is not exactly trivial. In addition to having to fetch the pages, you need to parse them, find all the links to additional resources, download these resources, and change the links to point to the local copies. You probably don't want to write this yourself. Of the tools available on most any Unix box, wget is capable of doing this for you. If you want something written in Perl, try w3mir. Makeshifts last the longest.	[reply]
Re: Capturing web pages and making them static by MidLifeXis (Monsignor) on Aug 04, 2004 at 21:05 UTC
If you use Apache, look at mod_proxy. IIRC, there is an example in the documentation on how to implement exactly your problem. --MidLifeXis	[reply]
Re: Capturing web pages and making them static by Fletch (Bishop) on Aug 04, 2004 at 23:33 UTC
Erm, if you're generating the pages yourself just rewrite things to output the HTML to a file. Or even if you don't want to change the CGIs just write a wrapper script which sets the appropriate environment variables (`$ENV{QUERY_STRING}=q{foo=bar&zagnork=wubble};`) and then runs the CGI and redirects the output into a file. Then you run your wrapper from cron and save the results somewhere under your document root.	[reply] [d/l]
Re: Capturing web pages and making them static by GaijinPunch (Pilgrim) on Aug 04, 2004 at 23:41 UTC
I am actually working on something similar to this myself (except I parse the info and email what I want to msyelf.) If you're wanting to simply grab a webpage from some place on the net and save it, LWP will work perfectly. `use LWP::Simple; my $url = "www.perlmonks.org"; my $webpage = get( $url );` [download] Then just write $webpage somewhere, and you've got a "saved" html file. I think this is what you're asking for. EDIT: I don't think LWP::Simple will save header information though, which may or may not be an issue for you.	[reply] [d/l]
Re: Capturing web pages and making them static by johnnywang (Priest) on Aug 05, 2004 at 00:03 UTC
A few other CPAN modules might be useful: WWW::Mechanize and WWW:WebRobot. I've only used these for automated tests. I guess in your case, you still need to change the link in the original page (say images) to local copies. w3mir can be used for downloading a whole subtree, most likely can't do exactly what you want since your page can (theoretically) include an image from somewhere else.	[reply]
Re^2: Capturing web pages and making them static by GaijinPunch (Pilgrim) on Aug 05, 2004 at 00:54 UTC
Interesting -- do any of the above modules to the image name changing for you? A favorite game developer of mine is reworking their entire website, and was thinking of backing it up before it got too restructured... they leave a lot of good information out sometimes.	[reply]
Re: Capturing web pages and making them static by Wassercrats (Initiate) on Aug 05, 2004 at 01:14 UTC
I read part of the synopsis for w3mir at http://search.cpan.org/dist/w3mir/w3mir.PL. I wish people who don't write English well would have someone edit the documentation for them. This part makes setup sound complicated: For authentication and passwords, multiple site retrievals and such you will have to resort to a "CONFIGURATION-FILE". If browsing from a filesystem references ending in '/' needs to be rewritten to end in '/index.html', and in any case, if there are URLs that are redirected will need to be changed to make the mirror browseable, see the documentation of Fixup in the "CONFIGURATION-FILE" secton. w3mirs default behavior is to do as little as possible and to be as nice as possible to the server(s) it is getting documents from. You will need to read through the options list to make w3mir do more complex, and, useful things. People are quick to say not to reinvent the wheel, but they'll still tell you to deal with modules when there are complete solutions available. Look at http://www.httrack.com/.	[reply]


Your skill will accomplish what the force of many cannot
	PerlMonks