Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

Re: Download web page including css files, images, etc.

by gaal (Parson)
on Jan 25, 2007 at 14:14 UTC ( #596484=note: print w/ replies, xml ) Need Help??


in reply to Download web page including css files, images, etc.

Can't you download to a temporary area with wget and rename the html to "index.html"?


Comment on Re: Download web page including css files, images, etc.
Replies are listed 'Best First'.
Re^2: Download web page including css files, images, etc.
by skx (Parson) on Jan 25, 2007 at 14:23 UTC

    I thought about this, but couldn't see the obvious way of determining the "main" file.

    For example if you run:

    wget --page-requisites http://en.wikipedia.org/
    

    The output produced is:

    en.wikipedia.org/
    |-- robots.txt
    `-- wiki
        `-- Main_Page
    

    Determining that wiki/Main_Page should be transformed to index.html is hard..

    Steve
    --
      Just do:
      wget --server-response http://en.wikipedia.org/
      and you can parse out the redirects:
      --13:13:55-- http://en.wikipedia.org/ => `index.html' Resolving en.wikipedia.org... 66.230.200.100 Connecting to en.wikipedia.org|66.230.200.100|:80... connected. HTTP request sent, awaiting response... HTTP/1.0 301 Moved Permanently Date: Thu, 25 Jan 2007 18:13:41 GMT Server: Apache X-Powered-By: PHP/5.1.4 Vary: Accept-Encoding,Cookie Cache-Control: s-maxage=1200, must-revalidate, max-age=0 Last-Modified: Thu, 25 Jan 2007 18:13:41 GMT Location: http://en.wikipedia.org/wiki/Main_Page Content-Type: text/html X-Cache: HIT from sq28.wikimedia.org X-Cache-Lookup: HIT from sq28.wikimedia.org:80 Age: 14 X-Cache: HIT from sq26.wikimedia.org X-Cache-Lookup: HIT from sq26.wikimedia.org:80 Via: 1.0 sq28.wikimedia.org:80 (squid/2.6.STABLE9), 1.0 sq26.wikimed +ia.org:80 (squid/2.6.STABLE9) Connection: close ---> Location: http://en.wikipedia.org/wiki/Main_Page [following] --13:13:55-- http://en.wikipedia.org/wiki/Main_Page => `Main_Page' Connecting to en.wikipedia.org|66.230.200.100|:80... connected. HTTP request sent, awaiting response... HTTP/1.0 200 OK Date: Thu, 25 Jan 2007 18:13:44 GMT Server: Apache X-Powered-By: PHP/5.1.4 Content-Language: en Vary: Accept-Encoding,Cookie Cache-Control: private, s-maxage=0, max-age=0, must-revalidate Last-Modified: Thu, 25 Jan 2007 17:28:15 GMT Content-Type: text/html; charset=utf-8 Age: 11 X-Cache: HIT from sq30.wikimedia.org X-Cache-Lookup: HIT from sq30.wikimedia.org:80 Via: 1.0 sq30.wikimedia.org:80 (squid/2.6.STABLE9) Connection: close Length: unspecified [text/html]

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://596484]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others contemplating the Monastery: (8)
As of 2015-07-30 09:56 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (271 votes), past polls