Re: Download web page including css files, images, etc.
by jhourcle (Prior) on Jan 25, 2007 at 14:39 UTC
|
I think I can accomplish this with wget, but not directly:
- Download the single file.
- Figure out what wget called the file (should be only one text file in the directory structure)
- Tell wget to do a full mirror of the file
- link index.html to the file found in step #2
Obviously, this wouldn't be unique to wget -- you could use the logic with anything that can get all of the dependencies.
Update: bah ... you probably can't just symlink it, as if it has relative links it'll crap out ... you might have to then re-adjust the directory structure (there's a call to wget to get it to reduce the number of directories deep it goes ... you could figure out what to pass to wget in step #2, I guess)
| [reply] |
|
| [reply] |
Re: Download web page including css files, images, etc.
by Arunbear (Prior) on Jan 25, 2007 at 14:32 UTC
|
httrack mostly does what you want e.g. when http://example.com/ is mirrored, the initial page will be http://example.com/index.html and this is the case even if the start url was /index.asp or /index.php or /index.whatever.
However a start url like http://example.com/home.php will be saved as http://example.com/home.html and I don't think there is an option for overriding that behaviour.
| [reply] |
Re: Download web page including css files, images, etc.
by starX (Chaplain) on Jan 25, 2007 at 14:54 UTC
|
I would take a look at HTTP LITE. It should be easy enough for Perl to download a web page, do a regexp scan for the files you're looking for, save that file to disk as index.html, and then start downloading all the other items you're looking for. Something like...
use HTTP::Lite;
my $http = new HTTP::Lite;
my $req = $http->request("http://www.something.com")
or die "Unable to get document: $!";
my $mirror_home = '/home/user/mirror_home/';
my (@javascript, @css, @jpg);
my $i = 0;
while ($http->body()){
if ($_ =~ m/*.jpg/){ push $_, @jpg;}
else if ($_ =~ m/*.js/){ push $_, @javascript;}
else if ($_ =~ m/*.css/){ push $_, @css;}
}
open FILE, "> $mirror_home/index.html"
or die "Couldn't open $mirror_home/index.html : $!";
print FILE $http->body();
close FILE;
while ($i <= $#css){
$req = $http->request("http://www.something.com/$css[$i]")
or die "Unable to get document: $!";
open FILE, "> $mirror_home/$css[$i]";
print FILE $http->body();
close FILE;
$i++
}
$i = 0;
# Then repeat for other extensions.
As a fair warning the above is definitely untested and probably horribly over-simplified, but the basic idea seems sound to me.
| [reply] [d/l] |
|
Yes (just in case anyone tries it) it is untested... and unfortunately has some bugs. Apart from the syntax errors that are quickly fixed (eg should be 'push @x , $_' not the other order used, there's one HUGE problem.
The WHILE statement, as written, will continue to download from the web page forever ! There's no end condition since the $http->body() grabs the whole page over and over.
So I think this is a good starting point.... but you then need take the 'html' returned by $http->body() and use an html parser to get the bits you want.
Sorry but I don't have the code for this at present. If I get something working I'll post it. But I thought it wise to warn the unwary.
| [reply] |
Re: Download web page including css files, images, etc.
by gaal (Parson) on Jan 25, 2007 at 14:14 UTC
|
Can't you download to a temporary area with wget and rename the html to "index.html"? | [reply] |
|
wget --page-requisites http://en.wikipedia.org/
The output produced is:
en.wikipedia.org/
|-- robots.txt
`-- wiki
`-- Main_Page
Determining that wiki/Main_Page should be transformed to index.html is hard..
| [reply] |
|
wget --server-response http://en.wikipedia.org/
and you can parse out the redirects:
--13:13:55-- http://en.wikipedia.org/
=> `index.html'
Resolving en.wikipedia.org... 66.230.200.100
Connecting to en.wikipedia.org|66.230.200.100|:80... connected.
HTTP request sent, awaiting response...
HTTP/1.0 301 Moved Permanently
Date: Thu, 25 Jan 2007 18:13:41 GMT
Server: Apache
X-Powered-By: PHP/5.1.4
Vary: Accept-Encoding,Cookie
Cache-Control: s-maxage=1200, must-revalidate, max-age=0
Last-Modified: Thu, 25 Jan 2007 18:13:41 GMT
Location: http://en.wikipedia.org/wiki/Main_Page
Content-Type: text/html
X-Cache: HIT from sq28.wikimedia.org
X-Cache-Lookup: HIT from sq28.wikimedia.org:80
Age: 14
X-Cache: HIT from sq26.wikimedia.org
X-Cache-Lookup: HIT from sq26.wikimedia.org:80
Via: 1.0 sq28.wikimedia.org:80 (squid/2.6.STABLE9), 1.0 sq26.wikimed
+ia.org:80 (squid/2.6.STABLE9)
Connection: close
---> Location: http://en.wikipedia.org/wiki/Main_Page [following]
--13:13:55-- http://en.wikipedia.org/wiki/Main_Page
=> `Main_Page'
Connecting to en.wikipedia.org|66.230.200.100|:80... connected.
HTTP request sent, awaiting response...
HTTP/1.0 200 OK
Date: Thu, 25 Jan 2007 18:13:44 GMT
Server: Apache
X-Powered-By: PHP/5.1.4
Content-Language: en
Vary: Accept-Encoding,Cookie
Cache-Control: private, s-maxage=0, max-age=0, must-revalidate
Last-Modified: Thu, 25 Jan 2007 17:28:15 GMT
Content-Type: text/html; charset=utf-8
Age: 11
X-Cache: HIT from sq30.wikimedia.org
X-Cache-Lookup: HIT from sq30.wikimedia.org:80
Via: 1.0 sq30.wikimedia.org:80 (squid/2.6.STABLE9)
Connection: close
Length: unspecified [text/html]
| [reply] [d/l] [select] |
Re: Download web page including css files, images, etc.
by davis (Vicar) on Jan 25, 2007 at 14:18 UTC
|
wget --mirror
davis
Kids, you tried your hardest, and you failed miserably. The lesson is: Never try.
| [reply] [d/l] |
|
wget --page-requisites http://example.com/
However if you try that you will see that it fails if you also try to specify the output filename.
| [reply] |
Re: Download web page including css files, images, etc.
by Anonymous Monk on Jan 25, 2007 at 14:55 UTC
|
I don't think wget will work in all situations.
1) it doesn't seem to handle the BASE element correctly (which I believe has been part of the HTML specification for a very long time).
2) "-k" won't translate links in CSS file to local links, consider #someid: background: url(folder/picture.jpg) center center;
Johannes | [reply] [d/l] [select] |
|
| [reply] |
|
True, I just thought I'd point this out to the original poster: wget won't do the job all the time. If he needs something that works every time, he'd need to use wget and do some of the work manually in case BASE element is involved or CSS is being used for images (maybe there are other problems there I haven't thought of?) - or write it from scratch ...
The trick would be going through the HTML and CSS specs and find every different way objects can be referenced/included/linked to etc. I'm sure there's plenty!
Johannes
| [reply] [d/l] |
|
|
A reply falls below the community's threshold of quality. You may see it by logging in.
|
Re: Download web page including css files, images, etc.
by Scott7477 (Chaplain) on Jul 18, 2007 at 20:59 UTC
|
I don't know if it is available, but perhaps a look at Microsoft's specification for their "web archive" .mht file format might be helpful. If one is looking at a webpage using Internet Explorer you have the option of saving the whole page as a single file ending with the .mht extension. I certainly wish that Firefox had this capability. Writing an extension for Firefox that provides this capability is on the part of my project list that's labeled "pie in the sky":)...
| [reply] |
|
For the record there is now MHT functionality for Firefox.
UnMHT
| [reply] |