Re: Download web page including css files, images, etc.

by starX (Chaplain)
on Jan 25, 2007

in reply to Download web page including css files, images, etc.

I would take a look at HTTP LITE. It should be easy enough for Perl to download a web page, do a regexp scan for the files you're looking for, save that file to disk as index.html, and then start downloading all the other items you're looking for. Something like...
use HTTP::Lite; my $http = new HTTP::Lite; my $req = $http->request("") or die "Unable to get document: $!"; my $mirror_home = '/home/user/mirror_home/'; my (@javascript, @css, @jpg); my $i = 0; while ($http->body()){ if ($_ =~ m/*.jpg/){ push $_, @jpg;} else if ($_ =~ m/*.js/){ push $_, @javascript;} else if ($_ =~ m/*.css/){ push $_, @css;} } open FILE, "> $mirror_home/index.html" or die "Couldn't open $mirror_home/index.html : $!"; print FILE $http->body(); close FILE; while ($i <= $#css){ $req = $http->request("$css[$i]") or die "Unable to get document: $!"; open FILE, "> $mirror_home/$css[$i]"; print FILE $http->body(); close FILE; $i++ } $i = 0; # Then repeat for other extensions.
As a fair warning the above is definitely untested and probably horribly over-simplified, but the basic idea seems sound to me.

Re^2: Download web page including css files, images, etc.
on Feb 08, 2012
    Yes (just in case anyone tries it) it is untested... and unfortunately has some bugs. Apart from the syntax errors that are quickly fixed (eg should be 'push @x , $_' not the other order used, there's one HUGE problem. The WHILE statement, as written, will continue to download from the web page forever ! There's no end condition since the $http->body() grabs the whole page over and over. So I think this is a good starting point.... but you then need take the 'html' returned by $http->body() and use an html parser to get the bits you want. Sorry but I don't have the code for this at present. If I get something working I'll post it. But I thought it wise to warn the unwary.

