Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

Re^9: Scraping Ajax / JS pop-up

by Corion (Patriarch)
on Feb 16, 2012 at 09:40 UTC ( [id://954174]=note: print w/replies, xml ) Need Help??


in reply to Re^8: Scraping Ajax / JS pop-up
in thread Scraping Ajax / JS pop-up

In my experience, you will have to look at the HTTP requests that go over the wire. The only "hands-off" solution that works well for my case is WWW::Mechanize::Firefox, but that should be no surprise as I wrote it. But even with WWW::Mechanize::Firefox, if you care for efficiency or speed, you will have to look at what HTTP requests are made and which requests can be skipped. Also, when automating a Javascript-heavy site, you will have to look at the Javascript to find out what functions to call instead of clicking elements on the page, to get the results in a more formatted way.

My reason for automating Firefox is that Firefox is a supported and interactive platform. If a website does not work with Firefox, it's the websites fault, not the fault of my program. And I can watch Firefox as it navigates through the website, which is a plus while developing the automation.

Of course, the module needs Firefox, and Firefox needs a display. There is PhantomJS, but so far I found the (lack of) model of interaction between the browser Javascript and the Javascript within the page lacking.

Replies are listed 'Best First'.
Re^10: Scraping Ajax / JS pop-up
by Monk-E (Initiate) on Feb 25, 2012 at 20:33 UTC
    So a quick update, to help anyone looking for a similar solution.

    I have a working scraper bot now, which handles the info in the AJAX/JS pop-up. I've had to resort to sniffing the HTTP with tools/browser plug-ins. I then mimic the HTTP POSTs that went over the wire using HTTP::Request::Common. This was the solution I was trying to avoid (as discussed above in this thread), primarily because if a bot needs to be more autonomous than mine, such as crawling, a more programmatic / self-contained solution is preferred. This is what I was trying to explain to Anonymous Monk. I tried several modules and ways to do that without success. But I should note to those who want to try, that I did not exhaust trying all routes that had potential, so more work with something like WWW::Mechanize::Firefox could possibly be fruitful.

    If your scraper is specific to a stable site or does not need to be an autonomous crawler, I would recommend to just cheat the complexity and sniff / mimic the HTTP as described in this thread.

      primarily because if a bot needs to be more autonomous than mine, such as crawling, a more programmatic / self-contained solution is preferred. This is what I was trying to explain to Anonymous Monk.

      That was easily understood. Your insistence that it needs to be pure-perl is the problem.

        I don't recall "insisting". I do, however, recall seeking, with due diligence (as would, say... a Monk?). ...for a Perl solution on a Perl forum. The reason being, explained above. There is nothing intrinsically limiting about Perl (or most languages with its degree of flexibility) that prevent it from being able to do this. So the expectation is reasonable. As it turns out, this topic continues to be an area active improvement effort in the Perl community.
Re^10: Scraping Ajax / JS pop-up
by Monk-E (Initiate) on Feb 22, 2012 at 01:12 UTC
    Thanks. :)

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://954174]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others browsing the Monastery: (4)
As of 2024-04-24 18:54 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found