Any spider framework?

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Any spider framework? by tobyink (Canon) on Jan 06, 2012 at 09:56 UTC
WWW::Crawler::Lite should probably do the job. Its HTML parsing is somewhat naive, but should work in the majority of cases.	[reply]
Re^2: Any spider framework? by bart (Canon) on Jan 06, 2012 at 12:29 UTC
You're right. I looked at the source and found this abomination: `s{<a\s+.?href\=(.?)>(.*?)</a>}{ ... }isgxe;` [download] Ouch. There are so many ways that this can go wrong: "a" tags with a "name" and no href attribute, whitespace around the "=", ... There are modules made espacially to extract links from HTML, for example HTML::LinkExtor and HTML::SimpleLinkExtor. Using one of those would have been a much safer approach. But at least, this module takes "robots.txt" files in consideration, which is the polite thing to do, and probably one of the first things to go in a more naive approach. So that is good.	[reply] [d/l]
Re^3: Any spider framework? by tobyink (Canon) on Jan 06, 2012 at 12:51 UTC
In the case of `<a name="foo">` it simply won't match, as the regexp includes href. And you wouldn't want it to match, as it's not a link. Whitespace around the equals sign (which is rare, but valid) is more problematic. There are other edge cases which behave differently to how you might want them to as well - note that the first subcapture allows ">" to occur within it. But in practise, it's probably good enough to work for the majority of people. The author may well accept a patch to parse the page properly using HTML::Parser given that the module already has a dependency on that module (indirectly, via LWP::UserAgent). Or if you can't wait for a new fixed version to be released, just subclass it - it's only really that one method that's in major need of fixing.	[reply] [d/l]
Re^4: Any spider framework? by jdrago999 (Pilgrim) on Jan 08, 2012 at 04:54 UTC
Re^4: Any spider framework? by bart (Canon) on Jan 10, 2012 at 08:07 UTC
Re^4: Any spider framework? by jdrago999 (Pilgrim) on Jan 08, 2012 at 06:40 UTC
Re: Any spider framework? by Anonymous Monk on Jan 07, 2012 at 04:51 UTC
you want WWW::Mechanize	[reply]
Re: Any spider framework? by spx2 (Deacon) on Jan 09, 2012 at 11:14 UTC
also check out Gungho and also Web::Scraper (have a look at this presentation)	[reply]


P is for Practical
	PerlMonks