Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Re: Download references list in pdf format with script

by kcott (Abbot)
on Oct 26, 2012 at 03:04 UTC ( #1000982=note: print w/ replies, xml ) Need Help??


in reply to Download references list in pdf format with script

Working on the assumption that the references will only find one PDF (which I'm not entirely convinced of), the following code should give you a starting point.

#!/usr/bin/env perl use strict; use warnings; use LWP::UserAgent; use URI::Escape; use File::Basename; our $VERSION = '0.001'; my $agent_name = join '/' => basename($0), $VERSION; my $query_base = 'https://duckduckgo.com/html/?q='; my $pdf_re = qr{href="([^"]+\.pdf)"}; my $ua = LWP::UserAgent->new(agent => $agent_name); while (<DATA>) { chomp; my $req = HTTP::Request->new(GET => $query_base . uri_escape($_)); $req->content_type('text/html'); my $res = $ua->request($req); if ($res->is_success) { print "Search successful.\n"; if ($res->content =~ $pdf_re) { my $pdf_url = $1; print "PDF found: $pdf_url\n"; process_pdf_url($pdf_url); } else { print "PDF not found!\n"; } } else { print $res->status_line, "\n"; } } sub process_pdf_url { my $pdf_url = shift; print "Stub - download $pdf_url,\n\trename, upload to database, et +c.\n"; return; } __DATA__ 1. Abilez O, Benharash P, Mehrotra M, Miyamoto E, Gale A, Picquet J +, Xu C, Zarins C (2006) A novel culture system shows that stem cells +can be grown in 3D and under physiologic pulsatile conditions for tis +sue engineering of vascular grafts. J Surg Res 132:170-178.

Output:

$ pm_web_search_pdf.pl Search successful. PDF found: http://med.stanford.edu/arts/arts_students/CVs/CV_abilez_09 +2007.pdf Stub - download http://med.stanford.edu/arts/arts_students/CVs/CV_abil +ez_092007.pdf, rename, upload to database, etc.

-- Ken


Comment on Re: Download references list in pdf format with script
Select or Download Code
Re^2: Download references list in pdf format with script
by bitingduck (Friar) on Oct 26, 2012 at 03:31 UTC

    I suspect your code already runs into one of the big problems that the OP will have-- if OP is looking for the paper that's referenced, rather than things that contain the reference, it's likely to be behind a paywall. The simple "grab the first pdf" is likely to get some combination of papers that reference the paper the OP is looking for, and which may be behind paywalls, or CV's of the authors (which you snagged).

      The OP seemed to think that his references would find a direct match; I said I wasn't convinced of this assumption. It would probably be more useful to convey your knowledge of paywalls, etc. to the OP rather than to me.

      I just wrote some code based on the information provided. :-)

      -- Ken

        And a pretty decent start for him indeed. Unfortunately all of my experience dealing with papers behind paywalls is from hand searching, and having to go in to work to dl them. Some of them don't work when you're VPN'd into the network that has a license. There's enough information in all the refs to find them, and with any luck the OP is running inside a university or someplace that has a license and can constrain the search to PubMed or a similar archive.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1000982]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others exploiting the Monastery: (9)
As of 2014-09-17 08:41 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (68 votes), past polls