Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

Re: Download web page including css files, images, etc.

by gaal (Parson)
on Jan 25, 2007 at 14:14 UTC ( #596484=note: print w/ replies, xml ) Need Help??


in reply to Download web page including css files, images, etc.

Can't you download to a temporary area with wget and rename the html to "index.html"?


Comment on Re: Download web page including css files, images, etc.
Re^2: Download web page including css files, images, etc.
by skx (Parson) on Jan 25, 2007 at 14:23 UTC

    I thought about this, but couldn't see the obvious way of determining the "main" file.

    For example if you run:

    wget --page-requisites http://en.wikipedia.org/
    

    The output produced is:

    en.wikipedia.org/
    |-- robots.txt
    `-- wiki
        `-- Main_Page
    

    Determining that wiki/Main_Page should be transformed to index.html is hard..

    Steve
    --
      Just do:
      wget --server-response http://en.wikipedia.org/
      and you can parse out the redirects:
      --13:13:55-- http://en.wikipedia.org/ => `index.html' Resolving en.wikipedia.org... 66.230.200.100 Connecting to en.wikipedia.org|66.230.200.100|:80... connected. HTTP request sent, awaiting response... HTTP/1.0 301 Moved Permanently Date: Thu, 25 Jan 2007 18:13:41 GMT Server: Apache X-Powered-By: PHP/5.1.4 Vary: Accept-Encoding,Cookie Cache-Control: s-maxage=1200, must-revalidate, max-age=0 Last-Modified: Thu, 25 Jan 2007 18:13:41 GMT Location: http://en.wikipedia.org/wiki/Main_Page Content-Type: text/html X-Cache: HIT from sq28.wikimedia.org X-Cache-Lookup: HIT from sq28.wikimedia.org:80 Age: 14 X-Cache: HIT from sq26.wikimedia.org X-Cache-Lookup: HIT from sq26.wikimedia.org:80 Via: 1.0 sq28.wikimedia.org:80 (squid/2.6.STABLE9), 1.0 sq26.wikimed +ia.org:80 (squid/2.6.STABLE9) Connection: close ---> Location: http://en.wikipedia.org/wiki/Main_Page [following] --13:13:55-- http://en.wikipedia.org/wiki/Main_Page => `Main_Page' Connecting to en.wikipedia.org|66.230.200.100|:80... connected. HTTP request sent, awaiting response... HTTP/1.0 200 OK Date: Thu, 25 Jan 2007 18:13:44 GMT Server: Apache X-Powered-By: PHP/5.1.4 Content-Language: en Vary: Accept-Encoding,Cookie Cache-Control: private, s-maxage=0, max-age=0, must-revalidate Last-Modified: Thu, 25 Jan 2007 17:28:15 GMT Content-Type: text/html; charset=utf-8 Age: 11 X-Cache: HIT from sq30.wikimedia.org X-Cache-Lookup: HIT from sq30.wikimedia.org:80 Via: 1.0 sq30.wikimedia.org:80 (squid/2.6.STABLE9) Connection: close Length: unspecified [text/html]

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://596484]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others perusing the Monastery: (6)
As of 2014-12-28 05:17 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (178 votes), past polls