Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

LWP::Simple on HTTPS sites

by rbhyland (Acolyte)
on Feb 09, 2017 at 19:29 UTC ( [id://1181572]=perlquestion: print w/replies, xml ) Need Help??

rbhyland has asked for the wisdom of the Perl Monks concerning the following question:

I have been using the following code for years and it recently stopped working. As near as I can tell it's because the site now has an https prefix. Can you show me how to tweak this program to get it working again? Thanks in advance!

#!"C:\xampp\perl\bin\perl.exe" use strict; use LWP::Simple; use CGI qw(:standard :cgi-lib); use CGI::Carp qw(fatalsToBrowser warningsToBrowser); my $current; my $currentUrl; my $title; my $alt; my $cgi = new CGI; print $cgi->header(); print start_html(-title =>'Save XKCD'); # Set Specifics my $sitePrefix = "https://xkcd.com/"; #my $sitePrefix = "http://www.google.com/"; ## Path to main XKCD directory ## my $path = "c:/Comics"; mkdir "$path/xkcd", 0755 or print "$path/xkcd Directory Exists\n",br; chomp($path = "$path/xkcd"); my $d = get("$sitePrefix"); if (!is_success($d)) { print "$d is not defined",br; } else { print "[ $d ]",br; } my $status; my $content; print "status = $status",br,"Content = $content",br; if ($d =~ /https:\/\/xkcd.com\/(\d+)\//) { $current = $1; print "Current = $current",br,"SitePrefix = $sitePrefix",br; } else { print "Permanent link not found",br; print "sitePrefix - ",$sitePrefix,br; print "\$d - [",$d,"]",br; } # Obtains all individual comic data sub getComicData { my $siteData = get("$sitePrefix$current/"); my @data = split /\n/, $siteData; foreach (@data) { if (/http:\/\/xkcd.com\/(\d+)\//) { $current = $1; } if ((/src="(http:\/\/imgs.xkcd.com\/comics\/.+\.\w{3})"/) || (/src="(\/\/imgs.xkcd.com\/comics\/.+\.\w{3})"/) ) { $currentUrl = $1; print "CurrentUrl = $currentUrl",br; if (/alt="(.+?)"/) { $title = $1; $title = "House of Pancakes" if $current == 472; # Co +lor title on comic 472 with weird syntax print "Title = $title",br; } if (/title="(.+?)"/) { #title commonly know as 'alt' te +xt $alt = $1; print "Alt = $alt",br; } } } } chdir "$path" or die "Cannot change directory: $!"; &getComicData(); while ( get("$sitePrefix$current/")){ print "Writing Files $current: $title\n",br,"CurrentUrl = $current +Url",br,br; # Create directories for individual comics mkdir "$current $title", 0755 or die "Previously Downloaded"; chdir "$path/$current $title" or die "Cannot change directory: $!" +; # Save image file if (index($currentUrl,"http") != 0) { $currentUrl = "http:".$currentUrl; } my $image = get($currentUrl); open my $IMAGE, '>>', "$title.png" or die "Cannot create image file!"; binmode($IMAGE); print $IMAGE $image; close $IMAGE; # Save alt text open my $TXT, '>>', "$title ALT.txt" or die "Cannot create text file!"; print $TXT $alt; close $TXT; chdir "$path" or die "Cannot change directory: $!"; $current--; # Check for non existent 404 comic $current-- if $current == 404; &getComicData(); } # End Gracefully print "Download Complete\n"; print end_html;

I have tried switching the LWP::UserAgent, but I still get the error.

Replies are listed 'Best First'.
Re: LWP::Simple on HTTPS sites ( WWW::Mechanize )
by Anonymous Monk on Feb 10, 2017 at 03:28 UTC

    Hi,

    Step one to fixing this is, forget the program exists, and define your goals

    For example , mirror the title/alt/image of all xkcd , so that would be

    - get xkcd page
    - extract info ( id title alt text image next )
    - save files 
    - repeat with next
    

    Next up tweak the goals a bit, be nice

    - get page if not already exist
    - extract info ( id title alt text image next ) and de-html-textify
    - save files with safe filenames 
    - repeat with next
    - wait andor quit, when done quit, when limit reached wait or quit until next time, keep track of progress
    

    Next is write (code) the program of goals

    save_xkcd( 'outdir', 'startingid' ); sub save_xkcd { $starting_id ||= id_from_progress(); my @ids = $starting_id; while( @ids ) { my $cid = shift @ids; my $page = sptintf '...%s', $cid; $mech->get( $page ); save_stuff( $mech, $cid ); next_page( $mech , \@ids ); maybe_sleep(); } }

    Now all you do is fill in the blanks

    No need for CGI in this equation, cgi doesnt like near infinite loops anyway

    $mech->title gets you de-htmld text like   xkcd: House of Pancakes

    HTML::TreeBuilder::XPath gets you the alt/title text with xpath query of '//img/@title' and next link with a query of '//a[@rel="next"]'

    Or  $mech->find_link( text_regex => qr/next/i );

    Yes, you could fix up your program by replacing LWP::Simple with mech ... but thats not exactly fun now isnt it :)

Re: LWP::Simple on HTTPS sites
by poj (Abbot) on Feb 09, 2017 at 20:06 UTC

    Why the extra bracket ?

    my $d = get("($sitePrefix");

    poj

      Thanks for pointing that out. It's an error, fixing it doesn't fix my problem, though

        Here is my current output:

        c:/Comics/xkcd Directory Exists is not defined status = Content = Permanent link not found sitePrefix - https://xkcd.com/ $d - [] Download Complete
Re: LWP::Simple on HTTPS sites
by nysus (Parson) on Feb 10, 2017 at 05:33 UTC

    I recently had a similar issue that caused any module–like LWP::UserAgent–which was relying on outdated versions of lower level modules like IO::Socket::SSL and Net::SSLeay, to break when fetching https sites. Once I upgraded my Debian install, the problem was fixed.

    So check your version of Perl and see if there is an upgrade available for these low level modules. As you are on Windows, I can't help much there.

    $PM = "Perl Monk's";
    $MCF = "Most Clueless Friar Abbot Bishop Pontiff Deacon Curate";
    $nysus = $PM . ' ' . $MCF;
    Click here if you love Perl Monks

      Thanks for the clue. After some more research along that line I fixed by changing my GET code to look like this:
      $ENV{'PERL_LWP_SSL_VERIFY_HOSTNAME'} = 0; my $ua = LWP::UserAgent->new( ssl_opts => { verify_hostname => 0, SSL_verify_mode => IO::Soc +ket::SSL::SSL_VERIFY_NONE, },); my $req = HTTP::Request->new( GET => 'https://xkcd.com/'); my $response = $ua->request($req); my $content = $response->content; my $sitePrefix = 'https://xkcd.com/';
      Now it works again. Thanks everyone!
        So the result of your research is that you disable any kind of certificate validation? Since you access only comics this is probably not a big deal but for anything serious this would be a bad idea since you effectively disable the protection HTTPS offers.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1181572]
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others exploiting the Monastery: (8)
As of 2024-04-25 11:28 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found