Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

Re^2: getting content of an https website

by Aldebaran (Curate)
on Sep 01, 2015 at 02:21 UTC ( [id://1140590]=note: print w/replies, xml ) Need Help??


in reply to Re: getting content of an https website
in thread getting content of an https website

Thanks, tangent, that's got it. With a little help from HTML::Tree, this suffices:

use strict; use warnings; use feature 'say'; use LWP::UserAgent; use HTML::Tree; my $url = 'https://berniesanders.com/issues/racial-justice/'; my $ua = LWP::UserAgent->new(); $ua->agent( 'Mozilla/5.0 (X11; Linux i586; rv:31.0) Gecko/20100101 Fire +fox/31.0' ); my $response = $ua->get($url); my $content = $response->content; if ( $content =~ m/enemy/i ) { say "enemy found"; } else { my $tree = HTML::Tree->new(); $tree->parse($content); print $tree->as_text; }

I've seen code like this before, and I thought I actually needed to have the browser in question, but apparently not. Am I correct to think that that string need to have nothing to do with the actual machine it runs on? Does the string you used make a good overall choice for such queries?

I'd like to consider a related question, given that we're barely warmed up here. I've always wanted the funtionality of having mechanized events happen and then having an actual browser opened. I don't know if one browser works better than another for this, but I use Chrome for most of my day-in and day-out surfing, viewing or whatever. Clearly, I would have to define a path to the executable, which I believe is here:

Directory of C:\Program Files (x86)\Google\Chrome\Application 08/22/2015 03:42 AM <DIR> . 08/22/2015 03:42 AM <DIR> .. 08/14/2015 12:43 PM <DIR> 44.0.2403.155 08/22/2015 03:42 AM <DIR> 44.0.2403.157 08/17/2015 10:23 PM 813,896 chrome.exe 06/03/2013 04:26 PM 18,546 master_preferences 06/19/2014 02:37 AM <DIR> Plugins 08/22/2015 03:42 AM 399 VisualElementsManifest.xml

How might I open the url from the original post in this browser?

Replies are listed 'Best First'.
Re^3: getting content of an https website
by Anonymous Monk on Sep 01, 2015 at 03:40 UTC

      Thanks AM, I got pretty far with this:

      use strict; use warnings; use feature 'say'; use HTML::Display; use LWP::UserAgent; my $url = 'https://berniesanders.com/issues/racial-justice/'; my $ua = LWP::UserAgent->new(); $ua->agent( 'Windows Mozilla'); my $response = $ua->get($url); my $content = $response->content; $ENV{'PERL_HTML_DISPLAY_COMMAND'}='run "C:\Program Files (x86)\Googl +e\Chrome\Application\chrome.exe" %s'; my $browser=HTML::Display->new(); if (defined($browser)) { $browser->display(html=>$content); } else { print("Unable to open browser: $@\n"); }

      Almost everything gets displayed except the big banner on top and some stylized words at the bottom. The links with absolute urls work, but there seems to be some clunkiness in the forward and back arrows on the browser, when it comes back to the original. And what is the original? In the url it looks like this:

      file:///C:/cygwin64/tmp/9EQdRdu_5w.html

      I have trouble deciding how "real" this is at all. Tomorrow, I'll try a different site and see what happens. Thank you.

Re^3: getting content of an https website
by Anonymous Monk on Sep 07, 2015 at 17:46 UTC
    system($url); will usually do it, depending on how paranoid you are, you might want to ensure that only properly encoded strings are executed.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1140590]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others imbibing at the Monastery: (4)
As of 2024-04-20 03:42 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found