Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

Best way to recursively grab a website

by ghenry (Vicar)
on Mar 29, 2005 at 10:41 UTC ( [id://443092]=perlquestion: print w/replies, xml ) Need Help??

ghenry has asked for the wisdom of the Perl Monks concerning the following question:

Dear all,

I have a two part program that sits in the hooks folder of a subversion repo. After a commit, the first part needs to checkout, or download the latest version either with the svn command or via a standard subversion apache browse. The second part uploads the newly grabbed files to aberdeen.pm.org via perldav.

The second part is done.

The first part, I create a temporary folder with File::Temp and then move on. Should I use LWP for recursive downloads? I don't really want to have a load of system commands etc.

What module/feature have I blindly missed?

Thanks,

Gavin.

Walking the road to enlightenment... I found a penguin and a camel on the way..... Fancy a yourname@perl.me.uk? Just ask!!!

Replies are listed 'Best First'.
Re: Best way to recursively grab a website
by gjb (Vicar) on Mar 29, 2005 at 11:53 UTC

    If you don't mind one system call, you could go with wget, an excellent tool to download an entire website. Command line option allow to restrict downloads to a single site, a certain depth and what not. All in all, a very valuable tool. It can be found at http://www.gnu.org/software/wget/wget.html.

    Did I mention it's free software (a GNU project to be precise)?

    Hope this helps, -gjb-

      I think that will be the easiest method.

      Thanks.

      Walking the road to enlightenment... I found a penguin and a camel on the way..... Fancy a yourname@perl.me.uk? Just ask!!!
Re: Best way to recursively grab a website
by tlm (Prior) on Mar 29, 2005 at 10:52 UTC

    WWW::Mechanize is my tool of choice for that sort of thing.

    the lowliest monk

      Of course. I remember that now. I have read loads of articles about it, but it slipped my mind.

      Cheers!

      I'll post my code later.

      Walking the road to enlightenment... I found a penguin and a camel on the way..... Fancy a yourname@perl.me.uk? Just ask!!!
Re: Best way to recursively grab a website
by webchalkboard (Scribe) on Mar 29, 2005 at 10:45 UTC

    maybe i'm just not as techy as you guys, but that first bit was double dutch to me :)

    you could try using wget, that's a program i've used in the past for mirroring websites, otherwise there is a mirror routine inthe LWP::Simple module on cpan which might do it.

    http://search.cpan.org/~gaas/libwww-perl-5.803/lib/LWP/Simple.pm
    mirror($url, $file) Get and store a document identified by a URL, using If-modified-since, + and checking the Content-Length. Returns the HTTP response code.

    Hope that's some help.

    Learning without thought is labor lost; thought without learning is perilous. - Confucius
    WebChalkboard.com | For the love of art...

      Yeah, sorry. I added the background of the problem, just so people who are familiar with Subversion, understand where I am coming from.

      I have looked at LWP::Simple, but the mirror function looks like it is per-url, not per website?

      I could use wget, but I want to do it all in perl, in case wget is not available.

      Thanks.

      Walking the road to enlightenment... I found a penguin and a camel on the way..... Fancy a yourname@perl.me.uk? Just ask!!!
        LWP ships with a number of example applications / utilities such as GET, POST etc. One of these utilities is lwp-mirror.

        From the documentation:

        This program can be used to mirror a document from a WWW server. The document is only transfered if the remote copy is newer than the local copy. If the local copy is newer nothing happens.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://443092]
Approved by ysth
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chilling in the Monastery: (7)
As of 2024-04-24 03:30 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found