Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

LWP and Mechanize

by perlmike (Initiate)
on May 21, 2022 at 19:34 UTC ( #11144051=perlquestion: print w/replies, xml ) Need Help??

perlmike has asked for the wisdom of the Perl Monks concerning the following question:

I used to use LWP and Mechanize to download files from the internet, but they stopped working today. Does anyone know what's going on? Thanks!

use strict; use warnings; use LWP::UserAgent(); my $ua = LWP::UserAgent->new(timeout => 10); $ua->env_proxy; my $response = $ua->get('https://www.sec.gov/Archives/edgar/full-index +/2019/QTR1/master.idx'); open(OUT, ">" . "master") or die "Cannot open master"; if ($response->is_success) {print OUT $response->decoded_content; } else {die $response->status_line;} close OUT;

Replies are listed 'Best First'.
Re: LWP and Mechanize
by pryrt (Monsignor) on May 21, 2022 at 20:44 UTC
    Have you checked the TOS for that site?

    When I tried the script you showed, it gave me a 403 Forbidden error. When I checked with Chrome, it downloaded fine. When I tried a curl -v https://www.sec.gov/Archives/edgar/full-index/2019/QTR1/master.idx, it was a bit more specific:

    < HTTP/1.1 403 Forbidden < Server: AkamaiGHost < Mime-Version: 1.0 < Content-Length: 4793 < Cache-Control: no-cache, no-store, must-revalidate < Pragma: no-cache < Expires: 0 < Content-Type: text/html < Date: Sat, 21 May 2022 20:40:48 GMT < Connection: keep-alive < Strict-Transport-Security: max-age=31536000 ; includeSubDomains ; pr +eload ... <title>SEC.gov | Request Rate Threshold Exceeded</title> ... <h1>Your Request Originates from an Undeclared Automated Tool</h1> <p>To allow for equitable access to all users, SEC reserves the right +to limit requests originating from undeclared automated tools. Your r +equest has been identified as part of a network of automated tools ou +tside of the acceptable policy and will be managed until action is ta +ken to declare your traffic.</p> <p>Please declare your traffic by updating your user agent to include +company specific information.</p> ... <p>For best practices on efficiently downloading information from SEC. +gov, including the latest EDGAR filings, visit <a href="https://www.s +ec.gov/developer" target="_blank">sec.gov/developer</a>. You can also + <a href="https://public.govdelivery.com/accounts/USSEC/subscriber/ne +w?topic_id=USSEC_260" target="_blank">sign up for email updates</a> o +n the SEC open data program, including best practices that make it mo +re efficient to download data, and SEC.gov enhancements that may impa +ct scripted downloading processes. For more information, contact <a h +ref="mailto:opendata@sec.gov">opendata@sec.gov</a>.</p> <p>For more information, please see the SECís <a href="#internet">Web +Site Privacy and Security Policy</a>. Thank you for your interest in +the U.S. Securities and Exchange Commission. <p>Reference ID: 0.9db31bb8.1653165648.37b3e960</p>

    Basically, you need to make sure you are following their TOS in terms of load limits, and define a user-agent string that meets their rules. (Or if you want to risk violating the SEC's rules, use a user-agent string that mimics a browser's string without looking up what their rules are ↗). Both LWP::UserAgent and WWW::Mechanize allow setting the user agent, and document how to do so.


    ↗: Looks like LanX determined that wouldn't work in id://11144056, which wasn't there when I started writing my post.
    edit 2: you could have seen the full error message yourself if you had checked for content as well as status during the else condition, like else {die $response->status_line . ($response->content||'');}

      This is very helpful! Thank you very much. How to print out content during the else condition?

        How to print out content during the else condition?

        That was in my edit2 section: else {die $response->status_line . ($response->content||'');}

        Or, putting it into your whole script:

        use strict; use warnings; use LWP::UserAgent(); my $ua = LWP::UserAgent->new(timeout => 10); $ua->env_proxy; my $response = $ua->get('https://www.sec.gov/Archives/edgar/full-index +/2019/QTR1/master.idx'); open(OUT, ">" . "master") or die "Cannot open master"; if ($response->is_success) {print OUT $response->decoded_content; } else {die $response->status_line . ($response->content||'');} close OUT;
Re: LWP and Mechanize
by LanX (Sage) on May 21, 2022 at 20:19 UTC
    Why don't you try to open that URL in a browser?

    update

    Sorry, never mind!

    Selecting the URL and opening it with right-click in FF results in a permission denied, but that's because perlmonks inserted invisible word-break code <wbr> for wrapping.

    No problem opening the URL when I take the raw code.

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery

Re: LWP and Mechanize
by Anonymous Monk on May 21, 2022 at 19:45 UTC
    "stopped working"??
      OK let's do the OP's homework:

      Running that script leads to a 403 Forbidden while the url can be opened via FF.

      FWIW: I changed the UserAgent to the one from FF but no avail.

      • Maybe missing HTTPS certificate?
      • Maybe some more elaborate bot detection?
      • Maybe ...
      Anyway not a Perl problem, changing the URL to "https://perlmonks.org/" works.

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      Wikisyntax for the Monastery

      Yes, I could use the code to download files last year, but I couldn't download any files today. Not sure what's wrong.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11144051]
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others imbibing at the Monastery: (2)
As of 2022-06-26 16:41 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    My most frequent journeys are powered by:









    Results (86 votes). Check out past polls.

    Notices?