Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery

Any spider framework?

by Anonymous Monk
on Jan 06, 2012 at 08:03 UTC ( #946548=perlquestion: print w/replies, xml ) Need Help??
Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I want to get all urls like 'http://site/fixed_string/random_string.html' from one site. Where should I start?

Is there any spider framework, support proxy,cache, and so on, suit my need?

Or, If I start from scratch, using LWP, is there some guide for write a spider?


Replies are listed 'Best First'.
Re: Any spider framework?
by tobyink (Abbot) on Jan 06, 2012 at 09:56 UTC

    WWW::Crawler::Lite should probably do the job. Its HTML parsing is somewhat naive, but should work in the majority of cases.

      You're right. I looked at the source and found this abomination:
      s{<a\s+.*?href\=(.*?)>(.*?)</a>}{ ... }isgxe;
      Ouch. There are so many ways that this can go wrong: "a" tags with a "name" and no href attribute, whitespace around the "=", ...

      There are modules made espacially to extract links from HTML, for example HTML::LinkExtor and HTML::SimpleLinkExtor. Using one of those would have been a much safer approach.

      But at least, this module takes "robots.txt" files in consideration, which is the polite thing to do, and probably one of the first things to go in a more naive approach. So that is good.

        In the case of <a name="foo"> it simply won't match, as the regexp includes href. And you wouldn't want it to match, as it's not a link. Whitespace around the equals sign (which is rare, but valid) is more problematic. There are other edge cases which behave differently to how you might want them to as well - note that the first subcapture allows ">" to occur within it.

        But in practise, it's probably good enough to work for the majority of people.

        The author may well accept a patch to parse the page properly using HTML::Parser given that the module already has a dependency on that module (indirectly, via LWP::UserAgent).

        Or if you can't wait for a new fixed version to be released, just subclass it - it's only really that one method that's in major need of fixing.

Re: Any spider framework?
by Anonymous Monk on Jan 07, 2012 at 04:51 UTC
Re: Any spider framework?
by spx2 (Deacon) on Jan 09, 2012 at 11:14 UTC

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://946548]
Approved by Corion
Front-paged by toolic
[marto]: good morning all, snow day here, won't make it to the office
[Corion]: marto: Oof - that means taking a day off or can you work from home?
[marto]: Corion, I can do some non-technical things, but there's no way for me to connect to the clients network.
[marto]: which is a shame, I had a really productive day yesterday
[Corion]: marto: Meh, so it'll be a day of cleaning out email...
[marto]: and hoped that I'd be able to continue the momentum :)
[marto]: I can't even access client email, nor my employers since those idiots moved to citrix

How do I use this? | Other CB clients
Other Users?
Others exploiting the Monastery: (7)
As of 2018-01-16 08:47 GMT
Find Nodes?
    Voting Booth?
    How did you see in the new year?

    Results (175 votes). Check out past polls.