bobf has asked for the wisdom of the Perl Monks concerning the following question:

I'm having trouble accessing a MediaWiki site using Perl, but I can use the same URL in a browser and it appears to work.

Minimal test case:

use strict; use warnings; use Data::Dumper; use MediaWiki::API; my $mw = MediaWiki::API->new(); $mw->{config}->{api_url} = 'https://cabig-kc.nci.nih.gov/Vocab/KC/api. +php'; # modify the LWP::UserAgent object so it looks like a browser $mw->{ua}->agent( 'Mozilla/5.0' ); # https://cabig-kc.nci.nih.gov/Vocab/KC/api.php?action=query&list=allp +ages&aplimit=max my $titles = $mw->api( { action => 'query', list => 'allpages', aplimi +t => 'max' } ) || die $mw->{error}->{code} . ': ' . $mw->{error}->{details};

The output I get is:

2: 403 Forbidden : error occurred when accessing https://cabig-kc.nci. +nih.gov/Vocab/KC/api.php

I used a browser to access the URL above (see comment in example code) and the results of the query were displayed.

I also tried the equivalent procedure using LWP::UserAgent, but received the same 403 forbidden error as above.

use strict; use warnings; use LWP::UserAgent; my $ua = LWP::UserAgent->new(); $ua->agent( 'Mozilla/5.0' ); my $url = 'https://cabig-kc.nci.nih.gov/Vocab/KC/api.php?action=query& +list=allpages&aplimit=max'; my $response = $ua->get( $url ); if( $response->is_success ) { print $response->content; } else { print "Error: " . $response->status_line . "\n"; }

Any suggestions?

TIA

Replies are listed 'Best First'.
Re: Error accessing MediaWiki API
by holli (Abbot) on Mar 27, 2009 at 16:29 UTC
    You need to install Crypt::SSLeay.


    holli

    When you're up to your ass in alligators, it's difficult to remember that your original purpose was to drain the swamp.

      Ah, yes. I didn't realize when the site was rebuilt that they changed it to https.

      I've already got Crypt::SSLeay version 0.57 installed. No joy.

        There is a Perl module to provide access to this thing.Have your tried this?
        http://search.cpan.org/~exobuzz/MediaWiki-API/

        Update oh I see you have that one. I was also trying with LWP and tried enabling cookies and a few other things. I can login to other secure sites, but something is weird about this one. I don't know why.

        Update:Still haven't figure this out, but there appears to be a couple of flavors of SSL: SSL2, SSL3. SSLeay-0.81 has SSL3. Crypt-SSLeay 0.061 requires SSLeay 0.81. I have Crypt:SSLeay 0.57. I am clueless as to how to get these later versions for Perl or if they are even necessary. But I thought I'd mention this possibility as it may help others "in the hunt".

Re: Error accessing MediaWiki API
by bobf (Monsignor) on Mar 28, 2009 at 20:08 UTC

    OK, here are more pieces to the puzzle. The issue is not yet resolved so I would still appreciate input on this.

    I did more searches through MediaWiki docs and ultimately ended up on the MediaWiki IRC channel (http://www.mediawiki.org/wiki/MediaWiki_on_IRC). With the help from those guys, I found the following:

    • The query shown in the test code in the parent node (my $titles = $mw->api(...)) works. It was verified against two test sites:
      # $mw->{config}->{api_url} = 'http://test.wikipedia.org/w/api.php'; # $mw->{config}->{api_url} = 'https://secure.wikimedia.org/wikipedia/t +est/w/api.php';
      Therefore, I don't think the issue is with either the module or the API call.
    • As mentioned in the parent node, the query URL works fine when accessed from a browser (specifically, Firefox 3.0.5).
    • I compared the headers from the browser (captured via Live HTTP Headers) with those in the $mw object (via Data::Dumper). Other than more detail provided in the output by Data::Dumper related to the ssl certificate/etc they looked equivalent. The only difference that stuck out to my eyes was that the perl code used a POST method while the browser used GET.
    • After examining the output of Dumper( $mw ) I noticed that while the HTTP::Response object contained only the 403 error shown in the parent node (and the stack trace contained no new information), the content of the returned page was not null and may be significant:
      '_content' => '<?xml version="1.0" encoding="ISO-8859-1"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://w +ww.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en" +> <head> <title>Access forbidden!</title> <link rev="made" href="mailto:you@example.com" /> <style type="text/css"><!--/*--><![CDATA[/*><!--*/ body { color: #000000; background-color: #FFFFFF; } a:link { color: #0000CC; } p, address {margin-left: 3em;} span {font-size: smaller;} /*]]>*/--></style> </head> <body> <h1>Access forbidden!</h1> <p> You don\'t have permission to access the requested obj +ect. It is either read-protected or not readable by the ser +ver. </p> <p> If you think this is a server error, please contact the <a href="mailto:you@example.com">webmaster</a>. </p> <h2>Error 403</h2> <address> <a href="/">cabig-kc.nci.nih.gov</a><br /> <span>Sat Mar 28 02:18:28 2009<br /> Apache</span> </address> </body> </html>

    My conclusion is that despite setting the agent to 'Mozilla/5.0' the program is still not acting enough like a browser. My naive assessment is that the server is rejecting the request because it looks too much like a bot, but the functionality is available because the same request from a browser works.

    So my question becomes: How do I make the program look more like a browser? Did I miss something in the headers? I can post more information if requested, but I don't know what to look for.

    My dear monks, what am I missing?

    Thanks

      The headers may have been equivalent, but they weren't identical. The missing piece is the Accept header.

      rfc2616 says: If no Accept header field is present, then it is assumed that the client accepts all media types.

      It will work if you add headers to say that explicitly

      $mw->{ua}->default_header('Accept' => "*/*");
      This is clearly a bug with that webserver, it doesn't implement HTTP/1.x as it claims.

        Bingo! That did the trick. Thank you very much for your insight. You made my day. :-)

Re: Error accessing MediaWiki API
by Anonymous Monk on Mar 27, 2009 at 20:58 UTC
    include
    $mw->{error}->{stacktrace}
    May reveal more than FORBIDDEN