Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much

how to access HTML within a javascript

by Special_K (Beadle)
on Mar 20, 2013 at 03:00 UTC ( #1024419=perlquestion: print w/replies, xml ) Need Help??
Special_K has asked for the wisdom of the Perl Monks concerning the following question:

Through use of Firefox and firebug, I have determined that a page I would like to automate my interactions with using perl has the majority of its HTML code apparently controlled by a single javascript. When I first load the page with firebug on, I see:

body < html.js

in the status bar. All of the links, forms, etc. I need to interact with on the page can be found in the HTML code window of Firebug. The strange thing is that if I right click on the page itself and click "view source", none of that HTML is visible, and neither is any reference to this so-called html.js script. Is there any perl module that can get at that HTML code that's apparently hidden inside html.js?

Replies are listed 'Best First'.
Re: how to access HTML within a javascript
by davido (Archbishop) on Mar 20, 2013 at 04:59 UTC

    JavaScript can create content for the browser dynamically. A page that is heavily dependent on JavaScript can be difficult to scrape or automate, because often first you've got to execute the JavaScript to see what content it produces.

    While you're not going to find a Perl module with an embedded JavaScript interpreter, you can find tools that will help bail you out of a difficult situation. One is corion's WWW::Mechanize::Firefox. Another is Selenium (teamed up with CPAN modules that use selenium). Two totally different approaches. Both require a bit of work on your part as a programmer. But they are reasonable answers to the JavaScript problem.


        Indeed. WWW::Scripter is powered by JE, a very good pure Perl Javascript implementation. Other Javascript implementations for Perl include JavaScript::SpiderMonkey and JavaScript::V8 which are generally faster but offer poorer integration between the Javascript code and the Perl code.

        package Cow { use Moo; has name => (is => 'lazy', default => sub { 'Mooington' }) } say Cow->new->name
Re: how to access HTML within a javascript
by sundialsvc4 (Abbot) on Mar 20, 2013 at 17:32 UTC

    If this is the file that I remember, it’s a library for more-easily making modifications to the DOM = Domain Object Model ... which is the data-structure that is initially built by the browser during the course of parsing the HTML.

    The key realization here is that the initial state of the DOM is only the initial state.   Most JavaScript programs work by altering the DOM.   They can create, remove, alter any of the nodes in the DOM-tree ... all sorts of wonderful and marvelous things ... and the browser’s display will follow suit.   The actual DOM structure that you see ... has no “source” to be viewed.   It is an output of a (JavaScript) computer program.

      I'm updating this page so anyone who reads it later will know the resolution. It turns out that WWW::Mechanize::Firefox appears to solve my problem. Here is the script I used:

      #!/usr/bin/perl -w use strict; use WWW::Mechanize::Firefox; my $doc_filename = "/home/user1/doc.txt"; open(DOC_FILE, ">$content_filename") || die "$!"; my $mech = WWW::Mechanize::Firefox->new(activate => 1); $mech->get("<your_URL_here>"); printf("title: %s\n", $mech->title()); printf(DOC_FILE "%s\n", $mech->document());

      After running the above script, the generated doc.txt file contains all html inserted by the javascript. I obviously can't guarantee this will work on every page, but it could at least be a starting point for anyone who finds this thread while searching for a way to scrape a page containing javascript.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1024419]
Approved by davido
Front-paged by Corion
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (7)
As of 2018-05-23 09:30 GMT
Find Nodes?
    Voting Booth?