web crawler infrastructure

david2008 has asked for the wisdom of the Perl Monks concerning the following question:

Hi all,

I want to write an application which makes web crawling on a certain page and all his children (just 1 level).

I have the following requirements:

javascript handling. For example there are links which run javascript code which opens a new window and i want to parse this page.
pdf, word and ppt parsing
authentication by cookies. There are pages where first you have to login and then you are authenticated by the cookie in all other clicks

Do you know such a cpan module which can provide this functionality?
I saw in google that part of these questions were asked in the past, but i want to know if there is a module which have all this features combined.

Thanks, David

Comment on web crawler infrastructure

Replies are listed 'Best First'.
Re: web crawler infrastructure by marto (Cardinal) on Jan 07, 2013 at 09:57 UTC
"Do you know such a cpan module which can provide this functionality?" There's no module on cpan which matches all three criteria. There are also other considerations, for example PDF files may simply by scanned images, meaning you'd have to OCR them to get the text. WWW::Mechanize::FireFox, PDF::OCR2, Super Search.	[reply]
Re: web crawler infrastructure by Anonymous Monk on Jan 07, 2013 at 09:55 UTC
Why sure, Web::Magic, maybe pdf/word... plugins	[reply]
Re: web crawler infrastructure by space_monk (Chaplain) on Jan 07, 2013 at 11:32 UTC
As marto pointed out there is no module which does all of what you ask, but one of the best ways to regard Perl is like Lego; if you use the right modules (bricks) you can build anything you want. Actually Lego is perhaps a bit of an incorrect analogy, with Perl you get pre-fabricated components that when used properly, help you get your house built faster. :-) A Monk aims to give answers to those who have none, and to learn from those who know more.	[reply]

Back to Seekers of Perl Wisdom