LWP::Simple on HTTPS sites

rbhyland has asked for the wisdom of the Perl Monks concerning the following question:

I have been using the following code for years and it recently stopped working. As near as I can tell it's because the site now has an https prefix. Can you show me how to tweak this program to get it working again? Thanks in advance!

#!"C:\xampp\perl\bin\perl.exe"

use strict;
use LWP::Simple;
use CGI qw(:standard :cgi-lib);
use CGI::Carp qw(fatalsToBrowser warningsToBrowser);

my $current;
my $currentUrl;
my $title;
my $alt;
my $cgi = new CGI;

print $cgi->header();

print start_html(-title =>'Save XKCD');

# Set Specifics
my $sitePrefix = "https://xkcd.com/";
#my $sitePrefix = "http://www.google.com/";

## Path to main XKCD directory ##
my $path = "c:/Comics";

mkdir "$path/xkcd", 0755 or print "$path/xkcd Directory Exists\n",br;
chomp($path = "$path/xkcd");

my $d = get("$sitePrefix");

if (!is_success($d)) {
    print "$d is not defined",br;
} else {
    print "[ $d ]",br;
}
my $status;
my $content;
print "status = $status",br,"Content = $content",br;
if ($d =~ /https:\/\/xkcd.com\/(\d+)\//) {
    $current = $1;
    print "Current = $current",br,"SitePrefix = $sitePrefix",br;
} else {
    print "Permanent link not found",br;
    print "sitePrefix - ",$sitePrefix,br;
    print "\$d - [",$d,"]",br;
}

# Obtains all individual comic data
sub getComicData {
    my $siteData = get("$sitePrefix$current/");
    my @data = split /\n/, $siteData;
    foreach (@data) {
        if (/http:\/\/xkcd.com\/(\d+)\//) {
            $current = $1;
        }
        if ((/src="(http:\/\/imgs.xkcd.com\/comics\/.+\.\w{3})"/) ||
            (/src="(\/\/imgs.xkcd.com\/comics\/.+\.\w{3})"/) ) {
            $currentUrl = $1;
              print "CurrentUrl = $currentUrl",br;
            if (/alt="(.+?)"/) {
                $title = $1;
                $title = "House of Pancakes" if $current == 472;  # Co
+lor title on comic 472 with weird syntax
                  print "Title = $title",br;
                }
            if (/title="(.+?)"/) {    #title commonly know as 'alt' te
+xt
                $alt = $1;
                  print "Alt = $alt",br;
            }
        }
    }
}

chdir "$path" or die "Cannot change directory: $!";
&getComicData();
while ( get("$sitePrefix$current/")){ 
    
    print "Writing Files $current: $title\n",br,"CurrentUrl = $current
+Url",br,br;
    # Create directories for individual comics
    mkdir "$current $title", 0755 or die "Previously Downloaded";
    chdir "$path/$current $title" or die "Cannot change directory: $!"
+;
    # Save image file
    if (index($currentUrl,"http") != 0) {
        $currentUrl = "http:".$currentUrl;
    }
    my $image = get($currentUrl);
    open my $IMAGE, '>>', "$title.png"
      or die "Cannot create image file!";
    binmode($IMAGE);
    print $IMAGE $image;
    close $IMAGE;

    # Save alt text
    open my $TXT, '>>', "$title ALT.txt"
      or die "Cannot create text file!";
    print $TXT $alt;
    close $TXT;
    chdir "$path" or die "Cannot change directory: $!";
    $current--;

    # Check for non existent 404 comic
    $current-- if $current == 404;

    &getComicData();
}


# End Gracefully
print "Download Complete\n";
print end_html;
[download]

I have tried switching the LWP::UserAgent, but I still get the error.

Comment on LWP::Simple on HTTPS sites Download Code

Replies are listed 'Best First'.
Re: LWP::Simple on HTTPS sites ( WWW::Mechanize ) by Anonymous Monk on Feb 10, 2017 at 03:28 UTC
Hi, Step one to fixing this is, forget the program exists, and define your goals For example , mirror the title/alt/image of all xkcd , so that would be - get xkcd page - extract info ( id title alt text image next ) - save files - repeat with next Next up tweak the goals a bit, be nice - get page if not already exist - extract info ( id title alt text image next ) and de-html-textify - save files with safe filenames - repeat with next - wait andor quit, when done quit, when limit reached wait or quit until next time, keep track of progress Next is write (code) the program of goals `save_xkcd( 'outdir', 'startingid' ); sub save_xkcd { $starting_id \|\|= id_from_progress(); my @ids = $starting_id; while( @ids ) { my $cid = shift @ids; my $page = sptintf '...%s', $cid; $mech->get( $page ); save_stuff( $mech, $cid ); next_page( $mech , \@ids ); maybe_sleep(); } }` [download] Now all you do is fill in the blanks No need for CGI in this equation, cgi doesnt like near infinite loops anyway $mech->title gets you de-htmld text like `xkcd: House of Pancakes` HTML::TreeBuilder::XPath gets you the alt/title text with xpath query of `'//img/@title'` and next link with a query of `'//a[@rel="next"]'` Or `$mech->find_link( text_regex => qr/next/i );` Yes, you could fix up your program by replacing LWP::Simple with mech ... but thats not exactly fun now isnt it :)	[reply] [d/l] [select]
Re: LWP::Simple on HTTPS sites by poj (Abbot) on Feb 09, 2017 at 20:06 UTC
Why the extra bracket ? `my $d = get("($sitePrefix");` poj	[reply] [d/l]
Re^2: LWP::Simple on HTTPS sites by rbhyland (Acolyte) on Feb 09, 2017 at 22:06 UTC
Thanks for pointing that out. It's an error, fixing it doesn't fix my problem, though	[reply]
Re^3: LWP::Simple on HTTPS sites by rbhyland (Acolyte) on Feb 09, 2017 at 22:12 UTC
Here is my current output: `c:/Comics/xkcd Directory Exists is not defined status = Content = Permanent link not found sitePrefix - https://xkcd.com/ $d - [] Download Complete` [download]	[reply] [d/l]
Re^4: LWP::Simple on HTTPS sites by poj (Abbot) on Feb 10, 2017 at 11:21 UTC
Re^5: LWP::Simple on HTTPS sites by rbhyland (Acolyte) on Feb 13, 2017 at 12:09 UTC
Some notes below your chosen depth have not been shown here
Re: LWP::Simple on HTTPS sites by nysus (Parson) on Feb 10, 2017 at 05:33 UTC
I recently had a similar issue that caused any module�like LWP::UserAgent�which was relying on outdated versions of lower level modules like IO::Socket::SSL and Net::SSLeay, to break when fetching https sites. Once I upgraded my Debian install, the problem was fixed. So check your version of Perl and see if there is an upgrade available for these low level modules. As you are on Windows, I can't help much there. $PM = "Perl Monk's"; $MCF = "Most Clueless ~~Friar~~ ~~Abbot~~ ~~Bishop~~ ~~Pontiff~~ ~~Deacon~~ Curate"; $nysus = $PM . ' ' . $MCF; Click here if you love Perl Monks	[reply]
Re^2: LWP::Simple on HTTPS sites by rbhyland (Acolyte) on Feb 25, 2017 at 09:32 UTC
Thanks for the clue. After some more research along that line I fixed by changing my GET code to look like this: `$ENV{'PERL_LWP_SSL_VERIFY_HOSTNAME'} = 0; my $ua = LWP::UserAgent->new( ssl_opts => { verify_hostname => 0, SSL_verify_mode => IO::Soc +ket::SSL::SSL_VERIFY_NONE, },); my $req = HTTP::Request->new( GET => 'https://xkcd.com/'); my $response = $ua->request($req); my $content = $response->content; my $sitePrefix = 'https://xkcd.com/';` [download] Now it works again. Thanks everyone!	[reply] [d/l]
Re^3: LWP::Simple on HTTPS sites by noxxi (Pilgrim) on Feb 25, 2017 at 18:26 UTC
So the result of your research is that you disable any kind of certificate validation? Since you access only comics this is probably not a big deal but for anything serious this would be a bad idea since you effectively disable the protection HTTPS offers.	[reply]


P is for Practical
	PerlMonks