http://www.perlmonks.org?node_id=636607

hacker has asked for the wisdom of the Perl Monks concerning the following question:

As many of you know, I do a lot of screen-scraping as part of my projects.

The best way to test a spider/screen-scraper written in perl (or any language) before pointing it to production content, is to run it against... pr0n sites. No, seriously!

  1. They love the traffic
  2. There's tons of links to images, popups, broken HTML, and so on.
  3. A well-behaved web spider would barely be a blip on their radar.

But back on track.. In some of the non-pr0n content (a big news site) I'm trying to scrape, there are links to sub-pages that I need content from, which are hidden inside onClick() and window.open calls via Javascript. You click a news article title, a window pops up and the content itself is in that secondary window.

I tried to use HTML::SimpleLinkExtor and friends to try to extract the links that point to those popup windows, but that module doesn't treat a remote URL inside a tag to be an href.

Here's a simplified example of what I'm trying to parse:

<td align="center" valign="middle"><a href = javascript:void(0) onmous +eover="window.status='This is my news article'; return true;" onmouse +out="window.status=''; return true;" onClick="window.open ('http://ne +ws.example.com/article0234/', 'News','alwaysRaised=1, toolbar=0, scro +llbars=0, location=0, statusbar=0, menubar=0, resizable=0, width=620, + height=400');" >New Link 0234</a></td>

In this code, clicking on "News Link 0234" on the main page will pop up a window that points to 'http://news.example.com/article0234/', and that popup window contains the content I need to scrape.

Has anyone tried to do this? I can do it with some really ugly regexes and grep(), but I'd prefer a cleaner option.

Replies are listed 'Best First'.
Re: Parsing content found in onClick and window.open Javascript calls
by moritz (Cardinal) on Sep 02, 2007 at 15:06 UTC
Re: Parsing content found in onClick and window.open Javascript calls
by Your Mother (Archbishop) on Sep 03, 2007 at 03:23 UTC

    Nice tips. :)

    Treat it as plain text and use URI::Find?

    Don't forget that one of the tricks they use is to make the JS hard to see so that filters/blockers won't catch them trying to put your browser into a circle-jerk. So you might, if you're being *thorough* have to do something crazy like (very unrefined)-

    my $esc = qr/[\\'," ]/; m,w(?:$esc)*i(?:$esc)*n(?:$esc)*d(?:$esc)*o(?:$esc)*w(?:$esc)*\.(?:$es +c)*o(?:$esc)*p(?:$esc)*e(?:$esc)*n(?:$esc)*\. ET CETERA,;