vishi83 has asked for the wisdom of the Perl Monks concerning the following question:
I want to detect broken links by running through a site with a script in perl. Which module can i use for doing this. Need help.
Thanks,
Vishy
|
---|
Replies are listed 'Best First'. | |
---|---|
Re: Detect Broken links
by CountZero (Bishop) on Oct 15, 2009 at 09:56 UTC | |
Do read the docs for HTML::SimpleLinkExtor as it has a lot more functionality, for instance what kind of links you want to extract. CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James | [reply] [d/l] |
by vishi83 (Pilgrim) on Oct 15, 2009 at 16:29 UTC | |
Using HTML::SimpleLinkExtor, can i get the http status of the url hit ? Thanks, Vishy
A perl Script without 'strict' is like a House without Roof; Both are not Safe;
| [reply] |
by CountZero (Bishop) on Oct 15, 2009 at 18:03 UTC | |
If you need more detailed information you have to use LWP::UserAgent: This will give you the HTTP status code. If you replace ->code by ->message, you get a human readable message instead of the three digit code. CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James | [reply] [d/l] [select] |
Re: Detect Broken links
by marto (Cardinal) on Oct 15, 2009 at 08:19 UTC | |
Depending on how complex the site in question is you could use either LWP or WWW::Mechanize. If the site in question uses lots of JavaScript see Using WWW::Selenium To Test Or Automate An Ajax Website. Martin | [reply] |
Re: Detect Broken links
by merlyn (Sage) on Oct 15, 2009 at 14:53 UTC | |
-- Randal L. Schwartz, Perl hacker The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119. | [reply] |
Re: Detect Broken links
by hawtin (Prior) on Oct 15, 2009 at 08:36 UTC | |
I have a very old script that I do that with, it uses LWP to fetch the pages and parses the html with regex (yes, I know better now, but it works). At the time I failed to identify any module to do what I, needed and later I added various bits and pieces to count words, list external links and so on. So rolling my own turned out to be the best way to go. Essentially it was:
Of course that code is chopped out of a much larger script and not tested, but I think it should give you all the bits you need (well except for extracting href values with a proper parser rather than using regex). Hope it helps. | [reply] [d/l] |