Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Running JavaScript from within Perl

by anautismobserver (Acolyte)
on Sep 13, 2019 at 01:06 UTC ( #11106108=perlquestion: print w/replies, xml ) Need Help??

anautismobserver has asked for the wisdom of the Perl Monks concerning the following question:

My goal is to develop an automated means to obtain the number of followers for each of a list of WordPress blog feeds (eg https://wordpress.com/read/feeds/93815501), which display the number of followers in a way that is captured by a copy-and-paste but is not in the HTML page source code. In order for these pages to display properly I need to be logged into a WordPress account and have JavaScript enabled in the browser.

So far I have the following code:

use strict; use warnings; use LWP::UserAgent; use LWP::Simple; use HTML::TreeBuilder; print HTML::TreeBuilder->new_from_url('https://wordpress.com/read/feed +s/94271045')->as_text;

This code produces the following output:

> WordPress.comPlease enable JavaScript in your browser to enjoy WordPress.com.

Naively it seems to me that since my browser can interpret a web page using JavaScript without any a priori information, Perl should be able to as well. Is this possible? If not, why not?

I'm a Perl novice who wants to get code running without learning Perl "from the ground up". My strategy has been to find working code samples that do pieces of what I want, then change them incrementally until they do all I want. I'm using Strawberry Perl on Windows.

Can you offer guidance or link to somewhere that explains it for novices like me? Thank you.

Replies are listed 'Best First'.
Re: Running JavaScript from within Perl (or just use the API)
by hippo (Chancellor) on Sep 13, 2019 at 08:17 UTC
    Can you offer guidance

    Did you know that (or even consider whether) WordPress has an API? Not only that but there is already a whole range of modules on CPAN which use it. Perhaps the ability to retrieve the follower count is available via that API and will save you all this scaping and javascripting and whatnot.

      I tried following the A Beginnersís Guide to the WordPress REST API tutorial

      It didn't work for my own (free) WordPress account, but when I used the "the-art-of-autism.com" (a premium account on which I have admin privileges) in place of "yourdomain.com" I was able to follow the tutorial successfully.

      However, none of the Routes or Endpoints seem to give me what I want, which is the number of followers for an arbitrary WordPress account on which I don't have admin privileges. I'm encouraged by the REST API Handbook Reference page stating "The REST API provides public data accessible to any client anonymously, as well as private data only available after authentication."

      I can't find any way to determine the number of followers, or what public data is accessible anonymously. Can you help with either of those? Thanks.

        It seems that the URL to use is

        https://developer.wordpress.com/docs/api/1.1/get/sites/$site/stats/fol +lowers/

        ... but you need to be authenticated:

        curl "https://public-api.wordpress.com/rest/v1.1/sites/the-art-of-auti +sm.com/stats/followers" {"error":"unauthorized","message":"user cannot view stats"}

        So, you will either have to get permission by the respective sites or you will have to continue scraping the websites.

      (Updated and clarified) The following endpoint:

      https://public-api.wordpress.com/rest/v1/read/feed/?url=the-art-of-aut +ism.com
      contains a "feed" url:

      https://public-api.wordpress.com/rest/v1/read/feed/34259929

      that I want to read.

      The following code (based on this JSON Tutorial) gives an error "Use of uninitialized value $feedurl in print".

      use strict; use warnings; use Mojo::UserAgent; my $url = 'https://public-api.wordpress.com/rest/v1/read/feed/?url=the-art-of-au +tism.com'; my $ua = Mojo::UserAgent->new; my $feedurl = $ua->get( $url )->result->json->{'feeds.meta.links.feed' +}; print $feedurl;

      Pleae tell me what I'm doing wrong. Thanks.

        What happened when you tried to adapt one of the previous examples you've been given?

        The following code works to assign $subscribers to subscribers_count, but gives an error "Use of uninitialized value $feedurl in print" for the assignment of $feedurl.

        use strict; use warnings; use Mojo::UserAgent; my $url = 'https://public-api.wordpress.com/rest/v1/read/feed/34259929'; my $ua = Mojo::UserAgent->new; my $subscribers = $ua->get($url)->result->json->{subscribers_count}; print "Number of subscribers: $subscribers\n"; my $feedurl = $ua->get( $url )->result->json->{'meta.links.self'}; print $feedurl;

        Pleae tell me what I'm doing wrong. Thanks.

Re: Running JavaScript from within Perl
by haukex (Chancellor) on Sep 13, 2019 at 05:34 UTC
    Naively it seems to me that since my browser can interpret a web page using JavaScript without any a priori information, Perl should be able to as well. Is this possible? If not, why not?

    JavaScript has access to a ton of things implemented in the browser, like the HTML document's DOM, various JavaScript APIs, and so on. To run JS code correctly, Perl would need to provide all of those, essentially re-implementing a browser, which is of course incredibly complex. See also the "JavaScript" section in WWW::Mechanize::FAQ.

    (For the general case of running JS from Perl, there was a talk in Riga: Embedding JavaScript in Perl.)

Re: Running JavaScript from within Perl
by Marshall (Abbot) on Sep 13, 2019 at 02:43 UTC
    Perl can't run Java script itself. One solution is to use: WWW::Mechanize::Chrome. Previously it was possible to automate Firefox and I played with that, but unfortunately Firefox took out the interface that allowed the automation to happen. I haven't used the Chrome version yet. Anyway the idea is to have Perl control Chrome which will run the Javascript code. Then you read what Chrome figured out.
Re: Running JavaScript from within Perl
by harangzsolt33 (Pilgrim) on Sep 14, 2019 at 05:10 UTC
    The JavaScript program on a web page can dynamically modify the page, so what you see has very little or no resemblance to the HTML source code! So, if you can scrape your web page using JavaScript, you get a peek at what's actually on the screen.

    Here is an example. When you click on the "View HTML" button on this page, you'll see one thing. Then you click on the "Change" button which modifies the code, and then you click on View HTML again, and you'll see the code with some slight changes. The source code hasn't changed, but what's in the memory has changed, and when you get to harvest that, you get the real picture.

    Here is the JavaScript program that harvests the HTML code:

    var DATA = document.all[0].innerHTML;

    If the block of HTML code you're trying to harvest is marked with an ID tag like this:

    <DIV ID="Part3"> ... OR <P ID="MyText"> ... OR <TABLE ID="Table2"> ...
    then you don't need to harvest the entire HTML page. All you have to do is harvest whatever is tagged. So, you would just do this:

    var DATA = document.getElementById("Part3").innerHTML;

    Instead of using "innerHTML," you could also use "innerText" which gives you only the plain text without all the HTML tags and whatnot:

    var DATA = document.getElementById("Part3").innerText;

    Once you have the code in the DATA variable, then you can run a regex or something to get the actual number you're looking for.. JavaScript regex works like perl's.

    <HTML> <BODY> <NOSCRIPT> <DIV STYLE="BACKGROUND-COLOR:RED; COLOR:WHITE; FONT-FAMILY:ARIAL;"><CE +NTER>This page requires JavaScript.</CENTER> </DIV> </NOSCRIPT> <H3 ID="HEADING">Welcome</H3> <DIV ID="CONTENT"> <P>This is a very simple HTML page. <P><INPUT TYPE=BUTTON VALUE=" View HTML " onClick="ViewHTML();"> <INPUT TYPE=BUTTON VALUE=" Change " onClick="DoSomething();"> </DIV> <SCRIPT> function ViewHTML() { var DATA = document.all[0].innerHTML; alert("This is the page content as seen from JavaScript:\n\n" + DATA +); } function DoSomething() { document.getElementById("HEADING").innerHTML = "<FONT COLOR=BLUE>DEA +R VISITOR</FONT>"; var MyCONTENT = document.getElementById("CONTENT"); MyCONTENT.innerHTML = "<FONT COLOR=RED>" + MyCONTENT.innerHTML; } </SCRIPT>

    I tested the above code, and it works in Firefox 52, KMeleon 7.5, QupZilla 1.8.6, Safari 5.1.7, Google Chrome 75, Internet Explorer 6, Opera 7.5, and Vivaldi 1.0. I have also tested it with an iPhone 7, Nokia Lumia 930 Windows Phone and an old Android 6 tablet. I haven't used any "ultra modern technology" that will break your phones. Everything in this example script is pretty standard.

    Once you get the number you want to send back to your perl script, you could send it back by loading a picture:

    <HTML> <BODY> <IMG NAME=PIX6 BORDER=0 WIDTH=1 HEIGHT=1 STYLE="POSITION:ABSOLUTE; TOP +:0; LEFT:0;"> <SCRIPT> NUMBER = 90; document.images.PIX6.src = "http://www.yourwebsite.com/yourscript.pl?" + + NUMBER; </SCRIPT>

    Here you're sending the number 90 back to your perl script.

    You could also signal to your perl script when somebody loads your web page with JavaScript turned off by putting a picture within the NOSCRIPT tags. Whatever you put between the NOSCRIPT tags will only appear when JavaScript is disabled on the page:

    <NOSCRIPT> <IMG SRC="http://www.yourwebsite.com/yourscript.pl?N" BORDER=0 WIDTH=1 + HEIGHT=1 STYLE="POSITION:ABSOLUTE; TOP:0; LEFT:0;"> </NOSCRIPT>

      None of this addresses what OP is trying to achieve.

        I'm having trouble understanding the WordPress.com REST API documentation. The example given for GET /sites/$site/posts/ is

        curl 'https://public-api.wordpress.com/rest/v1.1/sites/en.blog.wordpre +ss.com/posts/?number=2'

        which I couldn't figure out how to make work.

        By contrast, the example provided in A Beginnersís Guide to the WordPress REST API is

        curl -X GET -i http://the-art-of-autism.com/wp-json/wp/v2/posts

        which does work.

        Can you help me reconcile the two (which will hopefully help me interpret the rest of the WordPress REST API documentation)?

        Also, do REST API Resources only work on premium WordPress sites? I was able to execute GET /sites/$site/posts/ on the-art-of-autism.com (a premium site) but not on anautismobserver.wordpress.com (a free site). Do you know the reason for this?

        I really appreciate your help. You've already saved me a great deal of time and effort (and greatly increased my success chances). Thank you ever so much.

        Using GET /read/feed/$feed_url_or_id I can generate a web page containing the number of followers shown as "subscribers_count".

        How do I read this page into a perl script? I tried HTML::TreeBuilder and got the error message:

        https://public-api.wordpress.com/rest/v1/read/feed/http%3A%2F%2Fthe-art-of-autism.com%2Ffeed returned application/json not HTML

        Should I use WWW::Mechanize::Chrome, JSON, JavaScript, or something else? How do I provide them input from a URL?

        In response to your short example using Mojo::UserAgent: (which I couldn't figure out how to respond to directly):

        I modified your code as follows to read url's from a file:

        use strict; use warnings; use Mojo::UserAgent; my $filename = 'urls_Mojo.txt'; open(my $fh, '<:encoding(UTF-8)', $filename) or die "Could not open file '$filename' $!"; my $y = 0; # input row count while (my $row = <$fh>) { $y++; print $y; print " $row"; my $url = $row; # create a Mojo:UserAgent my $ua = Mojo::UserAgent->new; # use $ua to get the url and assign the value of 'subscriber_count' in + the json # to avariable, $subscribers my $subscribers = $ua->get( $url )->result->json->{subscribers_count}; # print the variable to screen print "Number of subscribers: $subscribers\n"; }

        it worked when the file 'urls_Mojo.txt' contained

        https://public-api.wordpress.com/rest/v1/read/feed/http%3A%2F%2Fthe-ar +t-of-autism.com%2Ffeed

        but gave a "Can't use an undefined value as a HASH reference" error when I added a second line to 'urls_Mojo.txt' as follows:

        https://public-api.wordpress.com/rest/v1/read/feed/http%3A%2F%2Fthe-ar +t-of-autism.com%2Ffeed https://public-api.wordpress.com/rest/v1/read/feed/http%3A%2F%2Fanauti +smobserver.wordpress.com%2Ffeed

        Can you help me figure out how to apply this script to a list of url's in a file? Thanks.

        When I type

        curl --help

        I get a list of options. Does that mean I have curl installed? (I don't remember installing it.)

        If not, please tell me how to install it from the zip file. Thanks.

        Do you know of a way to indent a script automatically? I currently use Notepad++ and also have Komodo IDE installed (but would happily use another editor that can indent automatically).

      I don't mind responses that go beyond the narrow bounds of what I asked. It's like learning a new language: sometimes it's best to immerse myself in the new culture and see what I can absorb.

      I like learning new software through following tutorials (though this runs the risk of learning outdated information). I'm starting working through The Ultimate Guide To The WordPress REST API (written in September 2015 by Josh Pollock). He recommends using Vagrant, VirtualBox, and Git, which I've downloaded and installed on my computer

      Is The Ultimate Guide To The WordPress REST API a good resource (obtained from here)?

      Do you know of any better (perhaps newer) tutorials for the WordPress REST API?

        "I don't mind responses that go beyond the narrow bounds of what I asked. It's like learning a new language: sometimes it's best to immerse myself in the new culture and see what I can absorb."

        The method described in the post you're replying to wont help you achieve what you asked to do. If you're interested in learning about JavaScript and HTML/DOM manipulation there are better resources (from the Mojolicious docs):

        "All web development starts with HTML, CSS and JavaScript, to learn the basics we recommend the Mozilla Developer Network. And if you want to know more about how browsers and web servers actually communicate, there's also a very nice introduction to HTTP."

        "I've downloaded and installed Vagrant, VirtualBox, and Git for Windows on my computer"

        What part of problem does this solve?

        "Is this a good resource? (https://wpengine.com/resources/the-ultimate-guide-to-the-wordpress-rest-api/)"

        I've no idea, you need to register to download an ebook.

        "Do you know of any better (perhaps newer) tutorials for the WordPress REST API?"

        What is missing from the official WordPress documentation?

        Update: Re^3: Running JavaScript from within Perl (or just use the API)/https://developer.wordpress.com/docs/api/1.1/get/sites/%24site/stats/followers/.

Re: Running JavaScript from within Perl
by Anonymous Monk on Sep 13, 2019 at 02:40 UTC
Re: Running JavaScript from within Perl
by FreeBeerReekingMonk (Deacon) on Sep 17, 2019 at 20:20 UTC
    Maybe also using PhantomJS works for you? (it has JavaScript interpretation). Note that it is almost abandoned.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://11106108]
Approved by haukex
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others wandering the Monastery: (7)
As of 2019-10-14 18:00 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Notices?