Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Re: Download web page including css files, images, etc.

by starX (Chaplain)
on Jan 25, 2007 at 14:54 UTC ( #596501=note: print w/ replies, xml ) Need Help??


in reply to Download web page including css files, images, etc.

I would take a look at HTTP LITE. It should be easy enough for Perl to download a web page, do a regexp scan for the files you're looking for, save that file to disk as index.html, and then start downloading all the other items you're looking for. Something like...

use HTTP::Lite; my $http = new HTTP::Lite; my $req = $http->request("http://www.something.com") or die "Unable to get document: $!"; my $mirror_home = '/home/user/mirror_home/'; my (@javascript, @css, @jpg); my $i = 0; while ($http->body()){ if ($_ =~ m/*.jpg/){ push $_, @jpg;} else if ($_ =~ m/*.js/){ push $_, @javascript;} else if ($_ =~ m/*.css/){ push $_, @css;} } open FILE, "> $mirror_home/index.html" or die "Couldn't open $mirror_home/index.html : $!"; print FILE $http->body(); close FILE; while ($i <= $#css){ $req = $http->request("http://www.something.com/$css[$i]") or die "Unable to get document: $!"; open FILE, "> $mirror_home/$css[$i]"; print FILE $http->body(); close FILE; $i++ } $i = 0; # Then repeat for other extensions.
As a fair warning the above is definitely untested and probably horribly over-simplified, but the basic idea seems sound to me.


Comment on Re: Download web page including css files, images, etc.
Download Code
Re^2: Download web page including css files, images, etc.
by trendle (Novice) on Feb 08, 2012 at 03:25 UTC
    Yes (just in case anyone tries it) it is untested... and unfortunately has some bugs. Apart from the syntax errors that are quickly fixed (eg should be 'push @x , $_' not the other order used, there's one HUGE problem. The WHILE statement, as written, will continue to download from the web page forever ! There's no end condition since the $http->body() grabs the whole page over and over. So I think this is a good starting point.... but you then need take the 'html' returned by $http->body() and use an html parser to get the bits you want. Sorry but I don't have the code for this at present. If I get something working I'll post it. But I thought it wise to warn the unwary.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://596501]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others meditating upon the Monastery: (19)
As of 2014-09-02 11:57 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite cookbook is:










    Results (22 votes), past polls