Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw

Parsing content found in onClick and Javascript calls

by hacker (Priest)
on Sep 02, 2007 at 14:51 UTC ( #636607=perlquestion: print w/replies, xml ) Need Help??
hacker has asked for the wisdom of the Perl Monks concerning the following question:

As many of you know, I do a lot of screen-scraping as part of my projects.

The best way to test a spider/screen-scraper written in perl (or any language) before pointing it to production content, is to run it against... pr0n sites. No, seriously!

  1. They love the traffic
  2. There's tons of links to images, popups, broken HTML, and so on.
  3. A well-behaved web spider would barely be a blip on their radar.

But back on track.. In some of the non-pr0n content (a big news site) I'm trying to scrape, there are links to sub-pages that I need content from, which are hidden inside onClick() and calls via Javascript. You click a news article title, a window pops up and the content itself is in that secondary window.

I tried to use HTML::SimpleLinkExtor and friends to try to extract the links that point to those popup windows, but that module doesn't treat a remote URL inside a tag to be an href.

Here's a simplified example of what I'm trying to parse:

<td align="center" valign="middle"><a href = javascript:void(0) onmous +eover="window.status='This is my news article'; return true;" onmouse +out="window.status=''; return true;" onClick=" ('http://ne', 'News','alwaysRaised=1, toolbar=0, scro +llbars=0, location=0, statusbar=0, menubar=0, resizable=0, width=620, + height=400');" >New Link 0234</a></td>

In this code, clicking on "News Link 0234" on the main page will pop up a window that points to '', and that popup window contains the content I need to scrape.

Has anyone tried to do this? I can do it with some really ugly regexes and grep(), but I'd prefer a cleaner option.

Replies are listed 'Best First'.
Re: Parsing content found in onClick and Javascript calls
by moritz (Cardinal) on Sep 02, 2007 at 15:06 UTC
Re: Parsing content found in onClick and Javascript calls
by Your Mother (Chancellor) on Sep 03, 2007 at 03:23 UTC

    Nice tips. :)

    Treat it as plain text and use URI::Find?

    Don't forget that one of the tricks they use is to make the JS hard to see so that filters/blockers won't catch them trying to put your browser into a circle-jerk. So you might, if you're being *thorough* have to do something crazy like (very unrefined)-

    my $esc = qr/[\\'," ]/; m,w(?:$esc)*i(?:$esc)*n(?:$esc)*d(?:$esc)*o(?:$esc)*w(?:$esc)*\.(?:$es +c)*o(?:$esc)*p(?:$esc)*e(?:$esc)*n(?:$esc)*\. ET CETERA,;

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://636607]
Approved by varian
[marto]: Wolfsbane , now I'm having flashbacks
[choroba]: Isn't Using PerlPod Creatively rather a meditation?
[choroba]: I don't see a question
[1nickt]: ugh, I stuck my head in the bass bin for 30 seconds on a dare at Ted Nugent at Hammersmith Odeon. Yes, I am 40% deaf now.
[johngg]: My daughter is incredibly jealous of my wife who got to see The Clash at Brixton many years ago. They went to see The Vaccines there recently.
[1nickt]: But the bands are even louder! I saw Spearhead (Michael Franti) at an outdoor show and had to walk a mile away to not feel pain in my chest! Babies were crying ... I asked the sound engineer why it was necessary to have the bass so loud and he laughed...
[Discipulus]: but the best i attended live was Mano Negra Patchanka at Forte Prenestino .. in 1990
[Corion]: Hmmm - Mano Negra or at least Manu Chao seem to put on a good live show. At least the one live CD I have from Manu Chao sounds good ;)
Discipulus feels the same jealousity of the johngg's daughter
[1nickt]: choroba I agree

How do I use this? | Other CB clients
Other Users?
Others rifling through the Monastery: (11)
As of 2017-03-24 12:15 GMT
Find Nodes?
    Voting Booth?
    Should Pluto Get Its Planethood Back?

    Results (301 votes). Check out past polls.