Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid

web crawler infrastructure

by david2008 (Scribe)
on Jan 07, 2013 at 09:53 UTC ( #1012005=perlquestion: print w/replies, xml ) Need Help??
david2008 has asked for the wisdom of the Perl Monks concerning the following question:

Hi all,

I want to write an application which makes web crawling on a certain page and all his children (just 1 level).

I have the following requirements:

  • javascript handling. For example there are links which run javascript code which opens a new window and i want to parse this page.
  • pdf, word and ppt parsing
  • authentication by cookies. There are pages where first you have to login and then you are authenticated by the cookie in all other clicks

Do you know such a cpan module which can provide this functionality?
I saw in google that part of these questions were asked in the past, but i want to know if there is a module which have all this features combined.

Thanks, David

Replies are listed 'Best First'.
Re: web crawler infrastructure
by marto (Archbishop) on Jan 07, 2013 at 09:57 UTC

    "Do you know such a cpan module which can provide this functionality?"

    There's no module on cpan which matches all three criteria. There are also other considerations, for example PDF files may simply by scanned images, meaning you'd have to OCR them to get the text. WWW::Mechanize::FireFox, PDF::OCR2, Super Search.

Re: web crawler infrastructure
by Anonymous Monk on Jan 07, 2013 at 09:55 UTC
Re: web crawler infrastructure
by space_monk (Chaplain) on Jan 07, 2013 at 11:32 UTC

    As marto pointed out there is no module which does all of what you ask, but one of the best ways to regard Perl is like Lego; if you use the right modules (bricks) you can build anything you want. Actually Lego is perhaps a bit of an incorrect analogy, with Perl you get pre-fabricated components that when used properly, help you get your house built faster. :-)

    A Monk aims to give answers to those who have none, and to learn from those who know more.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1012005]
Approved by marto
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others chanting in the Monastery: (7)
As of 2018-06-19 05:36 GMT
Find Nodes?
    Voting Booth?
    Should cpanminus be part of the standard Perl release?

    Results (111 votes). Check out past polls.