http://www.perlmonks.org?node_id=1012005

david2008 has asked for the wisdom of the Perl Monks concerning the following question:

Hi all,

I want to write an application which makes web crawling on a certain page and all his children (just 1 level).

I have the following requirements:

Do you know such a cpan module which can provide this functionality?
I saw in google that part of these questions were asked in the past, but i want to know if there is a module which have all this features combined.

Thanks, David

Replies are listed 'Best First'.
Re: web crawler infrastructure
by marto (Cardinal) on Jan 07, 2013 at 09:57 UTC

    "Do you know such a cpan module which can provide this functionality?"

    There's no module on cpan which matches all three criteria. There are also other considerations, for example PDF files may simply by scanned images, meaning you'd have to OCR them to get the text. WWW::Mechanize::FireFox, PDF::OCR2, Super Search.

Re: web crawler infrastructure
by Anonymous Monk on Jan 07, 2013 at 09:55 UTC
Re: web crawler infrastructure
by space_monk (Chaplain) on Jan 07, 2013 at 11:32 UTC

    As marto pointed out there is no module which does all of what you ask, but one of the best ways to regard Perl is like Lego; if you use the right modules (bricks) you can build anything you want. Actually Lego is perhaps a bit of an incorrect analogy, with Perl you get pre-fabricated components that when used properly, help you get your house built faster. :-)

    A Monk aims to give answers to those who have none, and to learn from those who know more.