hacker has asked for the wisdom of the Perl Monks concerning the following question:
As many of you know, I do a lot of screen-scraping as part of my projects.
The best way to test a spider/screen-scraper written in perl (or any language) before pointing it to production content, is to run it against... pr0n sites. No, seriously!
- They love the traffic
- There's tons of links to images, popups, broken HTML, and so on.
- A well-behaved web spider would barely be a blip on their radar.
But back on track.. In some of the non-pr0n content (a big news site) I'm trying to scrape, there are links to sub-pages that I need content from, which are hidden inside onClick() and window.open calls via Javascript. You click a news article title, a window pops up and the content itself is in that secondary window.
I tried to use HTML::SimpleLinkExtor and friends to try to extract the links that point to those popup windows, but that module doesn't treat a remote URL inside a tag to be an href.
Here's a simplified example of what I'm trying to parse:
<td align="center" valign="middle"><a href = javascript:void(0) onmous +eover="window.status='This is my news article'; return true;" onmouse +out="window.status=''; return true;" onClick="window.open ('http://ne +ws.example.com/article0234/', 'News','alwaysRaised=1, toolbar=0, scro +llbars=0, location=0, statusbar=0, menubar=0, resizable=0, width=620, + height=400');" >New Link 0234</a></td>
In this code, clicking on "News Link 0234" on the main page will pop up a window that points to 'http://news.example.com/article0234/', and that popup window contains the content I need to scrape.
Has anyone tried to do this? I can do it with some really ugly regexes and grep(), but I'd prefer a cleaner option.
|
---|
Replies are listed 'Best First'. | |
---|---|
Re: Parsing content found in onClick and window.open Javascript calls
by moritz (Cardinal) on Sep 02, 2007 at 15:06 UTC | |
Re: Parsing content found in onClick and window.open Javascript calls
by Your Mother (Archbishop) on Sep 03, 2007 at 03:23 UTC |