Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

Parsing AJAX-based website

by Anonymous Monk
on Feb 09, 2008 at 15:23 UTC ( #667175=perlquestion: print w/ replies, xml ) Need Help??
Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

For the first time I am trying to parse an AJAX-based website. My problem is that I don't know the general way of doing this.

Let's take the login-form on that site for example. There's a simple form with two form-fields (username, password) that calls a javascript function "j_security" on action. Filling in the values with a post/get is pretty trivial with LWP but I don't know how to do it in this case.

Most of the information on the site only gets fetched when a user clicks on links or certain elements. How do I simulate this to get to the information I want to extract actually?

Comment on Parsing AJAX-based website
Re: Parsing AJAX-based website
by grinder (Bishop) on Feb 09, 2008 at 16:25 UTC
    How do I simulate this to get to the information I want to extract actually?

    With great patience and attention to detail.

    It's just about impossible in the general case. Have a look at the WWW::Mechanize::FAQ, on the section "why don't you support JavaScript?". In a nutshell, you need to reimplement a JavaScript engine, and that's a non-trivial undertaking.

    The usual way of going about automating a website that makes heavy use of JavaScript is to insert an HTTP::Proxy-based proxy between your client and the server, to record precisely what is being passed back and forth. At the lowest level, it's just GET, POST and following redirects, and doing the same thing in your program.

    But depending on the site, this can be very difficult to do.

    • another intruder with the mooring in the heart of the Perl

      The javascript engine isn't even that big of a problem. There are at least two modules on CPAN that interface with the spidermonkey JS interpreter (though last time I checked, neither was complete and using them wasn't exactly trivial).

      The real issue is that you'd have to implement pretty much the whole DOM too, including most of the non-standardized stuff. Which is really, really tedious work.

Re: Parsing AJAX-based website
by bradcathey (Prior) on Feb 09, 2008 at 16:37 UTC

    Take a look at AHAH, a subset of Ajax that is much easier and does much of the same. Also, no XML—I just don't need the power of XML for basic Web forms, so I don't have to deal with the overhead of parsing XML (although there are some CPAN modules to help). I had my first AHAH/Perl example built in about an hour (created a select tag dynamically based on user choices).

    —Brad
    "The important work of moving the world forward does not wait to be done by perfect men." George Eliot
      How is AHAH supposed to help someone trying to scrape an existing AJAX site?

      -sam

Re: Parsing AJAX-based website
by samtregar (Abbot) on Feb 10, 2008 at 20:00 UTC
    My advice is to ignore the AJAX layer and target the lower-level HTTP requests going from the browser to the server. You can capture these using a proxy like HTTP::Recorder or something like the FireBug plugin.

    If you're lucky you'll find that underneath the glitzy AJAX there's a relatively simple protocol - the browser POSTs some data (user/pass) and gets back some JSON or XML indicating the result (login succeeded or failed). You can then use WWW::Mechanize to imitate that protocol and extract the info you need. If the site authors did a good job this can actually be easier than scraping an HTML page.

    Good luck!

    -sam

      I am trying to parse a website and the site loads a basic page that I can get with mechanize-get($url). But the Tags and Page data that I want to get is not found content. I think its an ajax call. I do not know how to get my script to get the tags. When I look at google chrome(element inspect), ie(F12 key), mozilla firebug all of them show the tags that I am looking for. But how do I get to the tags. Help will be appreciated.

        Look at the Net tab in firebug or 'Network' tab on chrome and find all the requests going out.

        Identify the request you are interested in - find out how that part is constructed ( based on the AJAX code that is present ) and make that call.

        As a note some of the modules mentioned above do this for you.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://667175]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others examining the Monastery: (9)
As of 2014-07-28 22:00 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (210 votes), past polls