Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked

web crawler infrastructure

by david2008 (Scribe)
on Jan 07, 2013 at 09:53 UTC ( #1012005=perlquestion: print w/replies, xml ) Need Help??
david2008 has asked for the wisdom of the Perl Monks concerning the following question:

Hi all,

I want to write an application which makes web crawling on a certain page and all his children (just 1 level).

I have the following requirements:

  • javascript handling. For example there are links which run javascript code which opens a new window and i want to parse this page.
  • pdf, word and ppt parsing
  • authentication by cookies. There are pages where first you have to login and then you are authenticated by the cookie in all other clicks

Do you know such a cpan module which can provide this functionality?
I saw in google that part of these questions were asked in the past, but i want to know if there is a module which have all this features combined.

Thanks, David

Replies are listed 'Best First'.
Re: web crawler infrastructure
by marto (Bishop) on Jan 07, 2013 at 09:57 UTC

    "Do you know such a cpan module which can provide this functionality?"

    There's no module on cpan which matches all three criteria. There are also other considerations, for example PDF files may simply by scanned images, meaning you'd have to OCR them to get the text. WWW::Mechanize::FireFox, PDF::OCR2, Super Search.

Re: web crawler infrastructure
by Anonymous Monk on Jan 07, 2013 at 09:55 UTC
Re: web crawler infrastructure
by space_monk (Chaplain) on Jan 07, 2013 at 11:32 UTC

    As marto pointed out there is no module which does all of what you ask, but one of the best ways to regard Perl is like Lego; if you use the right modules (bricks) you can build anything you want. Actually Lego is perhaps a bit of an incorrect analogy, with Perl you get pre-fabricated components that when used properly, help you get your house built faster. :-)

    A Monk aims to give answers to those who have none, and to learn from those who know more.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1012005]
Approved by marto
[marto]: Wolfsbane , now I'm having flashbacks
[choroba]: Isn't Using PerlPod Creatively rather a meditation?
[choroba]: I don't see a question
[1nickt]: ugh, I stuck my head in the bass bin for 30 seconds on a dare at Ted Nugent at Hammersmith Odeon. Yes, I am 40% deaf now.
[johngg]: My daughter is incredibly jealous of my wife who got to see The Clash at Brixton many years ago. They went to see The Vaccines there recently.
[1nickt]: But the bands are even louder! I saw Spearhead (Michael Franti) at an outdoor show and had to walk a mile away to not feel pain in my chest! Babies were crying ... I asked the sound engineer why it was necessary to have the bass so loud and he laughed...
[Discipulus]: but the best i attended live was Mano Negra Patchanka at Forte Prenestino .. in 1990
[Corion]: Hmmm - Mano Negra or at least Manu Chao seem to put on a good live show. At least the one live CD I have from Manu Chao sounds good ;)
Discipulus feels the same jealousity of the johngg's daughter

How do I use this? | Other CB clients
Other Users?
Others rifling through the Monastery: (11)
As of 2017-03-24 12:15 GMT
Find Nodes?
    Voting Booth?
    Should Pluto Get Its Planethood Back?

    Results (301 votes). Check out past polls.