http://www.perlmonks.org?node_id=596482

skx has asked for the wisdom of the Perl Monks concerning the following question:

I would like to download a complete webpage including any referenced .css, .js, and images - and have the page rewritten to reference the local copies.

However to complicated matters I wish to mandate that the initial file will be saved as “index.html” - regardless of what it was originally called.

This appears to rule wget out, as the –output=index.html option trumps the –page-requisites flag (which is used to download images, etc which are referenced.)

Now using LWP I can download the remote URI, and I assume that I could parse links out with HTML::Parser, or TreeBuilder - however this seems like a very simple request so I wondered if there were any existing libraries to do this kind of thing?

Searching CPAN for "http rewrite", "http mirror" didn't find anything that seems suitable, but any pointers greatfully received in case I'm searching for the wrong terms.

(Similarly if curl, wget, httrack, etc, can do this with clever options I'm not 100% committed to using perl!)
Steve
--
  • Comment on Download web page including css files, images, etc.

Replies are listed 'Best First'.
Re: Download web page including css files, images, etc.
by jhourcle (Prior) on Jan 25, 2007 at 14:39 UTC

    I think I can accomplish this with wget, but not directly:

    1. Download the single file.
    2. Figure out what wget called the file (should be only one text file in the directory structure)
    3. Tell wget to do a full mirror of the file
    4. link index.html to the file found in step #2

    Obviously, this wouldn't be unique to wget -- you could use the logic with anything that can get all of the dependencies.

    Update: bah ... you probably can't just symlink it, as if it has relative links it'll crap out ... you might have to then re-adjust the directory structure (there's a call to wget to get it to reduce the number of directories deep it goes ... you could figure out what to pass to wget in step #2, I guess)

      Thanks, I think your approach is interesting. I will give it a try and only fall back to mangling and parsing myself if it doesn't work out.

      Steve
      --
Re: Download web page including css files, images, etc.
by Arunbear (Prior) on Jan 25, 2007 at 14:32 UTC
    httrack mostly does what you want e.g. when http://example.com/ is mirrored, the initial page will be http://example.com/index.html and this is the case even if the start url was /index.asp or /index.php or /index.whatever.

    However a start url like http://example.com/home.php will be saved as http://example.com/home.html and I don't think there is an option for overriding that behaviour.

Re: Download web page including css files, images, etc.
by starX (Chaplain) on Jan 25, 2007 at 14:54 UTC
    I would take a look at HTTP LITE. It should be easy enough for Perl to download a web page, do a regexp scan for the files you're looking for, save that file to disk as index.html, and then start downloading all the other items you're looking for. Something like...
    use HTTP::Lite; my $http = new HTTP::Lite; my $req = $http->request("http://www.something.com") or die "Unable to get document: $!"; my $mirror_home = '/home/user/mirror_home/'; my (@javascript, @css, @jpg); my $i = 0; while ($http->body()){ if ($_ =~ m/*.jpg/){ push $_, @jpg;} else if ($_ =~ m/*.js/){ push $_, @javascript;} else if ($_ =~ m/*.css/){ push $_, @css;} } open FILE, "> $mirror_home/index.html" or die "Couldn't open $mirror_home/index.html : $!"; print FILE $http->body(); close FILE; while ($i <= $#css){ $req = $http->request("http://www.something.com/$css[$i]") or die "Unable to get document: $!"; open FILE, "> $mirror_home/$css[$i]"; print FILE $http->body(); close FILE; $i++ } $i = 0; # Then repeat for other extensions.
    As a fair warning the above is definitely untested and probably horribly over-simplified, but the basic idea seems sound to me.
      Yes (just in case anyone tries it) it is untested... and unfortunately has some bugs. Apart from the syntax errors that are quickly fixed (eg should be 'push @x , $_' not the other order used, there's one HUGE problem. The WHILE statement, as written, will continue to download from the web page forever ! There's no end condition since the $http->body() grabs the whole page over and over. So I think this is a good starting point.... but you then need take the 'html' returned by $http->body() and use an html parser to get the bits you want. Sorry but I don't have the code for this at present. If I get something working I'll post it. But I thought it wise to warn the unwary.
Re: Download web page including css files, images, etc.
by gaal (Parson) on Jan 25, 2007 at 14:14 UTC
    Can't you download to a temporary area with wget and rename the html to "index.html"?

      I thought about this, but couldn't see the obvious way of determining the "main" file.

      For example if you run:

      wget --page-requisites http://en.wikipedia.org/
      

      The output produced is:

      en.wikipedia.org/
      |-- robots.txt
      `-- wiki
          `-- Main_Page
      

      Determining that wiki/Main_Page should be transformed to index.html is hard..

      Steve
      --
        Just do:
        wget --server-response http://en.wikipedia.org/
        and you can parse out the redirects:
        --13:13:55-- http://en.wikipedia.org/ => `index.html' Resolving en.wikipedia.org... 66.230.200.100 Connecting to en.wikipedia.org|66.230.200.100|:80... connected. HTTP request sent, awaiting response... HTTP/1.0 301 Moved Permanently Date: Thu, 25 Jan 2007 18:13:41 GMT Server: Apache X-Powered-By: PHP/5.1.4 Vary: Accept-Encoding,Cookie Cache-Control: s-maxage=1200, must-revalidate, max-age=0 Last-Modified: Thu, 25 Jan 2007 18:13:41 GMT Location: http://en.wikipedia.org/wiki/Main_Page Content-Type: text/html X-Cache: HIT from sq28.wikimedia.org X-Cache-Lookup: HIT from sq28.wikimedia.org:80 Age: 14 X-Cache: HIT from sq26.wikimedia.org X-Cache-Lookup: HIT from sq26.wikimedia.org:80 Via: 1.0 sq28.wikimedia.org:80 (squid/2.6.STABLE9), 1.0 sq26.wikimed +ia.org:80 (squid/2.6.STABLE9) Connection: close ---> Location: http://en.wikipedia.org/wiki/Main_Page [following] --13:13:55-- http://en.wikipedia.org/wiki/Main_Page => `Main_Page' Connecting to en.wikipedia.org|66.230.200.100|:80... connected. HTTP request sent, awaiting response... HTTP/1.0 200 OK Date: Thu, 25 Jan 2007 18:13:44 GMT Server: Apache X-Powered-By: PHP/5.1.4 Content-Language: en Vary: Accept-Encoding,Cookie Cache-Control: private, s-maxage=0, max-age=0, must-revalidate Last-Modified: Thu, 25 Jan 2007 17:28:15 GMT Content-Type: text/html; charset=utf-8 Age: 11 X-Cache: HIT from sq30.wikimedia.org X-Cache-Lookup: HIT from sq30.wikimedia.org:80 Via: 1.0 sq30.wikimedia.org:80 (squid/2.6.STABLE9) Connection: close Length: unspecified [text/html]
Re: Download web page including css files, images, etc.
by davis (Vicar) on Jan 25, 2007 at 14:18 UTC
    wget --mirror

    davis
    Kids, you tried your hardest, and you failed miserably. The lesson is: Never try.

      Whilst your response is appreciated that doesn't do what I want.

      Indeed the obvious response might have been:

      wget --page-requisites http://example.com/
      

      However if you try that you will see that it fails if you also try to specify the output filename.

      Steve
      --
Re: Download web page including css files, images, etc.
by Anonymous Monk on Jan 25, 2007 at 14:55 UTC
    I don't think wget will work in all situations.

    1) it doesn't seem to handle the BASE element correctly (which I believe has been part of the HTML specification for a very long time).
    2) "-k" won't translate links in CSS file to local links, consider #someid: background: url(folder/picture.jpg) center center;

    Johannes

      True, but I think it is the most "standard" tool for the job - short of doing the parsing and rewriting myself.

      Steve
      --
        True, I just thought I'd point this out to the original poster: wget won't do the job all the time. If he needs something that works every time, he'd need to use wget and do some of the work manually in case BASE element is involved or CSS is being used for images (maybe there are other problems there I haven't thought of?) - or write it from scratch ...

        The trick would be going through the HTML and CSS specs and find every different way objects can be referenced/included/linked to etc. I'm sure there's plenty!

        Johannes
          A reply falls below the community's threshold of quality. You may see it by logging in.
Re: Download web page including css files, images, etc.
by Scott7477 (Chaplain) on Jul 18, 2007 at 20:59 UTC
    I don't know if it is available, but perhaps a look at Microsoft's specification for their "web archive" .mht file format might be helpful. If one is looking at a webpage using Internet Explorer you have the option of saving the whole page as a single file ending with the .mht extension. I certainly wish that Firefox had this capability. Writing an extension for Firefox that provides this capability is on the part of my project list that's labeled "pie in the sky":)...