perlmonkey2 has asked for the wisdom of the Perl Monks concerning the following question:
This isn't so much about Mechanize as it is about being gentle with webservers. If I'm just looking to spider a domain, looking for text, when examining links for where to go next, would this be the best way to only get HTML and be gentle to the webserver:
1. Regex the links for non-html extensions, keeping everything else as a possibility.
2. HEAD $url for text/html and charset info, if is_html and a good charset/no charset.
3. GET
Instead of doing the HEAD, I could just do a get, and check the header info and body for good HTML, but I thought just getting the header would be easier on the webserver even if it meant two seperate connections had to be made.
|
---|
Replies are listed 'Best First'. | |
---|---|
Re: WWW::Mechanize treading lightly
by davidrw (Prior) on Aug 03, 2006 at 16:13 UTC | |
by andyford (Curate) on Aug 03, 2006 at 16:22 UTC | |
by perlmonkey2 (Beadle) on Aug 03, 2006 at 17:09 UTC | |
Re: WWW::Mechanize treading lightly
by andyford (Curate) on Aug 03, 2006 at 16:04 UTC |
Back to
Seekers of Perl Wisdom