Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

Scraping AJAX?

by Dave_COS (Initiate)
on Jan 26, 2012 at 16:22 UTC ( #950158=perlquestion: print w/ replies, xml ) Need Help??
Dave_COS has asked for the wisdom of the Perl Monks concerning the following question:

Ok, I'm stumped.

I'm trying to extract data from a field on a dynamic web page. For an example, please go to

Dojo Demo

In this example, I'd like to extract 'Alexander' from the second row.

How the heck do I do this?

I've tried:

  • WWW::Mechanize -- no joy, it doesn't get the dynamic data.
  • WWW::Selenium -- can't figure out how to extract data (or even if I can)
  • Selenium::Remote::Driver -- can't figure out how to extract data (or even if I can)
  • WWW::HtmlUnit -- can't figure out how to extract data (or even if I can)

ARRGGGH! Has anyone done this, or can anyone point me in the right direction? I will be eternally grateful...

Dave

Comment on Scraping AJAX?
Re: Scraping AJAX?
by Anonymous Monk on Jan 26, 2012 at 16:31 UTC
Re: Scraping AJAX?
by kelchris (Novice) on Jan 26, 2012 at 16:40 UTC

    Use LWP::UserAgent or Mojo::UserAgent

    The datagrid is passed in json format from: http://dojotoolkit.org/documentation/tutorials/1.6/datagrid/demo/hof-batting.json

    Parse that JSON data using JSON::XS

Re: Scraping AJAX?
by Anonymous Monk on Jan 26, 2012 at 16:46 UTC
Re: Scraping AJAX?
by Anonymous Monk on Jan 26, 2012 at 16:57 UTC
    can't figure out how to extract data
    Learn XPath!
    use WWW::Mechanize::Firefox qw(); use HTML::TreeBuilder::LibXML qw(); my $mech = WWW::Mechanize::Firefox->new; $mech->get('http://dojotoolkit.org/documentation/tutorials/1.6/datagri +d/demo/datagrid-simple.html'); my $tree = HTML::TreeBuilder::LibXML->new; $tree->parse($mech->content); $tree->eof; my $name = $tree->findvalue('/html/body/div/div[2]/div/div/div/div/div +[2]/table/tbody/tr/td[2]');

      I do know XPath.

      More details: I'm trying to scrape a Blackberry Administration Service webpage, which is behind a firewall; otherwise I would have given that link.

      I'm trying to get the element

      //div\@id='dojox_grid__View_5'/div/div/div/div/table/tbody/tr/td20/span"

      Which Chrome "Inspect element" sees, and the Selenium IDE sets every time -- but when I try to run against WWW:Mechanize it shows NO content, and the Selenium packages state that element does not exist.

      Forgive me if the solution is obvious, but I *have* read the other posts -- and tried the code -- and have had no success.

        Thats because you need to get() the JSON data which contains the actual table you want to scrape. The content you are getting is only the page that contains the JS functions and there should also be the ajax link in there you can use to get the actual data.

        Once you get that link, get() it then parse the data from there.
Re: Scraping AJAX?
by ajinkyagadewar (Novice) on Apr 16, 2012 at 07:25 UTC

    When you hit the url for Dojo Demo the page you hit is the html page.

    Once the page is loaded a call goes to url 'http://dojotoolkit.org/documentation/tutorials/1.6/datagrid/demo/hof-batting.json' to pull the JSON data.

    my $mech = WWW::Mechanize->new();<br> $mech->get('http://dojotoolkit.org/documentation/tutorials/1.6/datagri +d/demo/hof-batting.json');<br>


    To extract data from content there are alternatives:
    1. Using simple regular expression you can pull the required data.

    2. Convert JSON to array of hashes and pull the required data.

    Let me know if it helps.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://950158]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others about the Monastery: (8)
As of 2014-09-18 11:23 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (112 votes), past polls