Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

WWW::Mechanize treading lightly

by perlmonkey2 (Beadle)
on Aug 03, 2006 at 14:41 UTC ( #565470=perlquestion: print w/ replies, xml ) Need Help??
perlmonkey2 has asked for the wisdom of the Perl Monks concerning the following question:

This isn't so much about Mechanize as it is about being gentle with webservers. If I'm just looking to spider a domain, looking for text, when examining links for where to go next, would this be the best way to only get HTML and be gentle to the webserver: 1. Regex the links for non-html extensions, keeping everything else as a possibility. 2. HEAD $url for text/html and charset info, if is_html and a good charset/no charset. 3. GET Instead of doing the HEAD, I could just do a get, and check the header info and body for good HTML, but I thought just getting the header would be easier on the webserver even if it meant two seperate connections had to be made.

Comment on WWW::Mechanize treading lightly
Re: WWW::Mechanize treading lightly
by andyford (Curate) on Aug 03, 2006 at 16:04 UTC

    Since the decision between GET/HEAD would be very website dependent, looking for non-html extensions would seem the way to go.

    Judicious use of sleep between requests would also be a simple way to be gentle.

Re: WWW::Mechanize treading lightly
by davidrw (Prior) on Aug 03, 2006 at 16:13 UTC
    How can you make _any_ decision based upon the extension of the url?!?!? Any of these (and many others) could produce html .. you really need to look for the Content-Type.
    http://perlmonks.org/?parent=565470;node_id=3333 http://example.com http://example.com/blah.html http://example.com/blah.foo http://example.com/blah.htm http://example.com/blah.php http://example.com/blah.cgi http://example.com/blah.pl http://example.com/blah.asp http://example.com/blah/foo/ http://example.com/blah.exe # even this, if someone so configured + the web server
      Right, but I was thinking that you could at least drop .mp3, .gif, .jpeg just for an easy first cut, no?
        That is exactly what I was thinking. Anyone who has their webserver configured to spit out HTML from .jpg extensions isn't a site I want to bother with. Extensions serve a purpose, and while they can be abused, that abuse would negate my need to see their text. Thanks for the input.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://565470]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others examining the Monastery: (9)
As of 2015-07-06 08:55 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (70 votes), past polls