Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

Any spider framework?

by Anonymous Monk
on Jan 06, 2012 at 08:03 UTC ( #946548=perlquestion: print w/ replies, xml ) Need Help??
Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

HI,all
I want to get all urls like 'http://site/fixed_string/random_string.html' from one site. Where should I start?

Is there any spider framework, support proxy,cache, and so on, suit my need?

Or, If I start from scratch, using LWP, is there some guide for write a spider?

thanks.

Comment on Any spider framework?
Re: Any spider framework?
by tobyink (Abbot) on Jan 06, 2012 at 09:56 UTC

    WWW::Crawler::Lite should probably do the job. Its HTML parsing is somewhat naive, but should work in the majority of cases.

      You're right. I looked at the source and found this abomination:
      s{<a\s+.*?href\=(.*?)>(.*?)</a>}{ ... }isgxe;
      Ouch. There are so many ways that this can go wrong: "a" tags with a "name" and no href attribute, whitespace around the "=", ...

      There are modules made espacially to extract links from HTML, for example HTML::LinkExtor and HTML::SimpleLinkExtor. Using one of those would have been a much safer approach.

      But at least, this module takes "robots.txt" files in consideration, which is the polite thing to do, and probably one of the first things to go in a more naive approach. So that is good.

        In the case of <a name="foo"> it simply won't match, as the regexp includes href. And you wouldn't want it to match, as it's not a link. Whitespace around the equals sign (which is rare, but valid) is more problematic. There are other edge cases which behave differently to how you might want them to as well - note that the first subcapture allows ">" to occur within it.

        But in practise, it's probably good enough to work for the majority of people.

        The author may well accept a patch to parse the page properly using HTML::Parser given that the module already has a dependency on that module (indirectly, via LWP::UserAgent).

        Or if you can't wait for a new fixed version to be released, just subclass it - it's only really that one method that's in major need of fixing.

Re: Any spider framework?
by Anonymous Monk on Jan 07, 2012 at 04:51 UTC
Re: Any spider framework?
by spx2 (Chaplain) on Jan 09, 2012 at 11:14 UTC

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://946548]
Approved by Corion
Front-paged by toolic
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chilling in the Monastery: (6)
As of 2015-07-03 20:18 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (56 votes), past polls