Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number

WWW::Mechanize treading lightly

by perlmonkey2 (Beadle)
on Aug 03, 2006 at 14:41 UTC ( #565470=perlquestion: print w/replies, xml ) Need Help??
perlmonkey2 has asked for the wisdom of the Perl Monks concerning the following question:

This isn't so much about Mechanize as it is about being gentle with webservers. If I'm just looking to spider a domain, looking for text, when examining links for where to go next, would this be the best way to only get HTML and be gentle to the webserver: 1. Regex the links for non-html extensions, keeping everything else as a possibility. 2. HEAD $url for text/html and charset info, if is_html and a good charset/no charset. 3. GET Instead of doing the HEAD, I could just do a get, and check the header info and body for good HTML, but I thought just getting the header would be easier on the webserver even if it meant two seperate connections had to be made.

Replies are listed 'Best First'.
Re: WWW::Mechanize treading lightly
by davidrw (Prior) on Aug 03, 2006 at 16:13 UTC
    How can you make _any_ decision based upon the extension of the url?!?!? Any of these (and many others) could produce html .. you really need to look for the Content-Type.;node_id=3333 # even this, if someone so configured + the web server
      Right, but I was thinking that you could at least drop .mp3, .gif, .jpeg just for an easy first cut, no?
        That is exactly what I was thinking. Anyone who has their webserver configured to spit out HTML from .jpg extensions isn't a site I want to bother with. Extensions serve a purpose, and while they can be abused, that abuse would negate my need to see their text. Thanks for the input.
Re: WWW::Mechanize treading lightly
by andyford (Curate) on Aug 03, 2006 at 16:04 UTC

    Since the decision between GET/HEAD would be very website dependent, looking for non-html extensions would seem the way to go.

    Judicious use of sleep between requests would also be a simple way to be gentle.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://565470]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (1)
As of 2018-08-18 14:09 GMT
Find Nodes?
    Voting Booth?
    Asked to put a square peg in a round hole, I would:

    Results (185 votes). Check out past polls.