Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

Spidering websites

by Whitchman (Novice)
on Apr 09, 2002 at 02:38 UTC ( [id://157635]=perlquestion: print w/replies, xml ) Need Help??

Whitchman has asked for the wisdom of the Perl Monks concerning the following question:

Is there anyway (stupid question, there's always a way) to have a perl script start at one URL and follow the links on the page and download certain files (like jpegs over 16 KB). I need it to be very specific in what pages it gets, like how far from the original URL it will go. I also need it to then sort the files into directories with the same structure it downloaded them from. Example: if a file came from "original_url/images/set1/image6.jpg" I would want it to go to something like "C:/images/set1/image6.jpg" and do that for all the images. Get it?

Replies are listed 'Best First'.
Re: Spidering websites
by tachyon (Chancellor) on Apr 09, 2002 at 03:28 UTC

    Link Checker is a web spider script and at this node the illustrious merlyn adds links to 4 spiders of his own. You should have little trouble modifying these scripts.

    cheers

    tachyon

    s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print

Re: Spidering websites
by Chmrr (Vicar) on Apr 09, 2002 at 04:23 UTC

    If you're looking at writing a spider, the WWW::Robot module should get some airtime. I used it with great success a while back to slurp the contents of a rather large and complex zoo of static pages into a dynamic engine. Especially cool in my eyes because it uses HTML::Treebuilder, which I also happen to like.

    perl -pe '"I lo*`+$^X$\"$]!$/"=~m%(.*)%s;$_=$1;y^`+*^e v^#$&V"+@( NO CARRIER'

Re: Spidering websites
by CukiMnstr (Deacon) on Apr 09, 2002 at 03:00 UTC
    ...and if you *really* want to do it in perl, then you can check LWP::RobotUA (and the other LWP:: modules), and then do a quick search here in the monastery to find some scripts that might guide you.

    hope this helps,

    Update: Changed LWP::UserAgent to LWP::RobotUA, thanks belg4mit.

      Umm better than that LWP::RobotUA, behave yourself.

      --
      perl -pe "s/\b;([mnst])/'\1/mg"

Re: Spidering websites
by premchai21 (Curate) on Apr 09, 2002 at 02:53 UTC

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://157635]
Approved by rob_au
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others contemplating the Monastery: (4)
As of 2025-03-18 13:27 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    When you first encountered Perl, which feature amazed you the most?










    Results (57 votes). Check out past polls.