Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

Using LWP (or some other module) to Dowload HTML with Cookie Session ID

by Leudwinus (Scribe)
on Nov 28, 2022 at 02:58 UTC ( [id://11148411]=perlquestion: print w/replies, xml ) Need Help??

Leudwinus has asked for the wisdom of the Perl Monks concerning the following question:

Hello Fellow Monks,

I am trying to download the HTML of a URL which requires authentication (username and password). Using curl, I am able to successfully do this via:

curl https://example.com --cookie "session=536...035a" -o "HTML_output.txt"

where the session ID is the one I obtained via the browser's developer tools function.

I am trying to replicate this using a more "native" Perl approach and using the LWP module, I have the following code which allows me to get the HTML of a given URL that does not require any authentication:

use LWP; my $url = "https://example.com"; my $browser = LWP::UserAgent->new( ); my $resp = $browser->get($url); print $resp->content, "\n";

Is there a way using LWP (or some other module) to also include the session ID from the browser cookie?

Thank you in advance.

As a quick update, I have modified my code to use HTTP::Cookies::Netscape as follows:

use strict; use warnings; use LWP; use HTTP::Cookies::Netscape; my $url = "https://example.com"; my $cookie_jar = HTTP::Cookies::Netscape->new( file => "/path/to/cookies.txt", ); my $browser = LWP::UserAgent->new( ); $browser->cookie_jar( $cookie_jar ); my $resp = $browser->get($url); print $resp->content, "\n";

Unfortunately, when I run that, the response that comes back tells me that I still need to supply a proper username and password, even though my session is still valid on the browser. I used the Firefox add-on "cookies.txt" to download a cookies.txt file which looks like:

# Netscape HTTP Cookie File .example.com TRUE / TRUE 1984932267 session 5361...1 +329

Replies are listed 'Best First'.
Re: Using LWP (or some other module) to Dowload HTML with Cookie Session ID
by Corion (Patriarch) on Nov 28, 2022 at 07:36 UTC

    An easy way to get a start is by converting your Curl command to a Perl program using (my) Curl to Perl converter:

    #!perl use strict; use warnings; use LWP::UserAgent; my $ua = LWP::UserAgent->new( 'send_te' => '0' ); my $r = HTTP::Request->new( 'GET' => 'https://example.com/', [ 'Accept' => '*/*', 'User-Agent' => 'curl/7.55.1', 'Cookie' => 'session=536...035a' ], ); my $res = $ua->request( $r, ':content_file' => 'HTML_output.txt' ); __END__ Created from curl command line curl https://example.com --cookie "session=536...035a" -o "HTML_output +.txt"

    I'm not sure if your Firefox stores its cookies in a cookies.txt file still or if it stores them in an SQLite database nowadays.

      Thank you, Corion for that example and link to your page which I have bookmarked!

      It looks like the content I need is then stored in $res->{_content}. I had to first install LWP::UserAgent and LWP::Protocol::https to get this to work though.

      I guess I need to get more familiar with the LWP, LWP::UserAgent and HTTP::Request modules.

      What is the difference between LWP::UserAgent and Mojo::UserAgent, the latter which was also suggested to me to try to solve this?

      Thanks again!

        No - please use $res->decoded_content instead of reaching into the HTTP::Response hash.

        The difference between LWP::UserAgent and Mojo::UserAgent is mostly that LWP::UserAgent doesn't allow for parallel requests. Mojo::UserAgent is a bit more complex due to allowing for parallelism, but it also offers some more convenience stuff like handling JSON replies directly.

Re: Using LWP (or some other module) to Dowload HTML with Cookie Session ID
by Discipulus (Canon) on Nov 29, 2022 at 11:43 UTC

      I saw that thread earlier today! I am following it as best I can. Apologies that I don't know enough yet to help you but I will be interested in what you find regarding interrogating Firefox cookies directly via SQLite.

A reply falls below the community's threshold of quality. You may see it by logging in.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11148411]
Approved by marto
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (4)
As of 2024-03-28 16:43 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found