Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??

As many of you know, I do a lot of screen-scraping as part of my projects.

The best way to test a spider/screen-scraper written in perl (or any language) before pointing it to production content, is to run it against... pr0n sites. No, seriously!

  1. They love the traffic
  2. There's tons of links to images, popups, broken HTML, and so on.
  3. A well-behaved web spider would barely be a blip on their radar.

But back on track.. In some of the non-pr0n content (a big news site) I'm trying to scrape, there are links to sub-pages that I need content from, which are hidden inside onClick() and window.open calls via Javascript. You click a news article title, a window pops up and the content itself is in that secondary window.

I tried to use HTML::SimpleLinkExtor and friends to try to extract the links that point to those popup windows, but that module doesn't treat a remote URL inside a tag to be an href.

Here's a simplified example of what I'm trying to parse:

<td align="center" valign="middle"><a href = javascript:void(0) onmous +eover="window.status='This is my news article'; return true;" onmouse +out="window.status=''; return true;" onClick="window.open ('http://ne +ws.example.com/article0234/', 'News','alwaysRaised=1, toolbar=0, scro +llbars=0, location=0, statusbar=0, menubar=0, resizable=0, width=620, + height=400');" >New Link 0234</a></td>

In this code, clicking on "News Link 0234" on the main page will pop up a window that points to 'http://news.example.com/article0234/', and that popup window contains the content I need to scrape.

Has anyone tried to do this? I can do it with some really ugly regexes and grep(), but I'd prefer a cleaner option.


In reply to Parsing content found in onClick and window.open Javascript calls by hacker

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others browsing the Monastery: (6)
As of 2024-04-23 10:25 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found