If you are seeking to create a link-checker, just search here for "Link check" and enjoy the many responses. merlyn
has written at least two articles on link checkers.
One way of doing it is to take an HTML page:
- Run it through HTML::LinkExtor to get the links (you may want to filter out the image links and mailto: references.) Don't forget to provide the server and base path if it's required. (see docs)
- Use either LWP::Simple or LWP::UserAgent to check those links. If you are just checking for "liveness", a HEAD request will suffice (and be kinder to your bandwidth). If you are spidering, you'd want to do a GET request.
- If you are spidering, you then do other steps such as adding new HTML pages to your queue of pages to check, watching the depth (how far from your original page you are), watching what server you're on so that you aren't trying to index the entire Internet, checking that you only index a given page once, respecting the rules given in robots.txt, etc.
By and large, if you just want a simple link-checker, go ahead and roll your own, it's a good simple learning experience. If you are trying to spider more than a page or two, you should probably not reinvent the wheel, so start with someone else's work.
Perldoc lwpcook has some basics, but it's best to figure out what you are trying to do, then look up how to do it, just as you don't cook by reading the cookbook cover-to-cover.
Be sure to read the threads that turn up in the search, this is territory that's been well covered.