Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

How do I perform web automation with sites that use Javascript?

( #761527=categorized question: print w/ replies, xml ) Need Help??
Contributed by whakka on May 02, 2009 at 18:58 UTC
Q&A  > HTTP and FTP clients


Description:

I need to automate a script for crawling web pages, but these pages use Javascript/AJAX for form processing and the like. LWP and WWW::Mechanize don't handle this case well. What can I do?

Answer: How do I perform web automation with sites that use Javascript?
contributed by jettero

JavaScript::SpiderMonkey seems to be in use by scripts built as recently as Net::Plurk::Dumper. Something to add to the list in any case.

Answer: How do I perform web automation with sites that use Javascript?
contributed by jdporter

Here are some modules which give you a way around the issue:

Other things to try:

  • Disable Javascript in your browser and see if the site still functions as you want. If so, then you don't actually have a problem :)
  • Figure out what the scripts are doing on the wire, and re-implement those transactions in your own program. The Firefox add-on Live HTTP Headers is well suited for this.

This info provided by the OP.

Answer: How do I perform web automation with sites that use Javascript?
contributed by planetscape

Don't forget Limbic~Region's excellent Tutorial, Using WWW::Selenium To Test Or Automate An Ajax Website.

Answer: How do I perform web automation with sites that use Javascript?
contributed by ninuzzo

A recent addition of mine is WWW::HtmlUnit::Spidey. This module uses the Java library HtmlUnit which is a headless browser with pretty good JavaScript support. Do not worry, you won't have to write any Java code :D

It is good for massive web scraping where screen scraping does not scale and may be unstable.

There is a tutorial here that scrapes some data obtained from a form not working without JavaScript support.

Btw I am just a Perl beginner. Any Perl guru interested in co-developing Spidey?

Please (register and) log in if you wish to add an answer



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?
    Username:
    Password:

    What's my password?
    Create A New User
    Chatterbox?
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others about the Monastery: (16)
    As of 2015-07-30 13:22 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?

      The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









      Results (271 votes), past polls