|Perl: the Markov chain saw|
|on Sep 02, 2007 at 14:51 UTC||Need Help??|
hacker has asked for the
wisdom of the Perl Monks concerning the following question:
As many of you know, I do a lot of screen-scraping as part of my projects.
The best way to test a spider/screen-scraper written in perl (or any language) before pointing it to production content, is to run it against... pr0n sites. No, seriously!
I tried to use HTML::SimpleLinkExtor and friends to try to extract the links that point to those popup windows, but that module doesn't treat a remote URL inside a tag to be an href.
Here's a simplified example of what I'm trying to parse:
In this code, clicking on "News Link 0234" on the main page will pop up a window that points to 'http://news.example.com/article0234/', and that popup window contains the content I need to scrape.
Has anyone tried to do this? I can do it with some really ugly regexes and grep(), but I'd prefer a cleaner option.