Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Re: Validating Links

by swiftone (Curate)
on Jan 16, 2003 at 16:32 UTC ( #227410=note: print w/ replies, xml ) Need Help??


in reply to Validating Links

If you are seeking to create a link-checker, just search here for "Link check" and enjoy the many responses. merlyn has written at least two articles on link checkers.

One way of doing it is to take an HTML page:

  • Run it through HTML::LinkExtor to get the links (you may want to filter out the image links and mailto: references.) Don't forget to provide the server and base path if it's required. (see docs)
  • Use either LWP::Simple or LWP::UserAgent to check those links. If you are just checking for "liveness", a HEAD request will suffice (and be kinder to your bandwidth). If you are spidering, you'd want to do a GET request.
  • If you are spidering, you then do other steps such as adding new HTML pages to your queue of pages to check, watching the depth (how far from your original page you are), watching what server you're on so that you aren't trying to index the entire Internet, checking that you only index a given page once, respecting the rules given in robots.txt, etc.
By and large, if you just want a simple link-checker, go ahead and roll your own, it's a good simple learning experience. If you are trying to spider more than a page or two, you should probably not reinvent the wheel, so start with someone else's work.

Perldoc lwpcook has some basics, but it's best to figure out what you are trying to do, then look up how to do it, just as you don't cook by reading the cookbook cover-to-cover.

Be sure to read the threads that turn up in the search, this is territory that's been well covered.


Comment on Re: Validating Links

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://227410]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others perusing the Monastery: (5)
As of 2014-07-26 02:10 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (175 votes), past polls