http://www.perlmonks.org?node_id=1000982


in reply to Download references list in pdf format with script

Working on the assumption that the references will only find one PDF (which I'm not entirely convinced of), the following code should give you a starting point.

#!/usr/bin/env perl use strict; use warnings; use LWP::UserAgent; use URI::Escape; use File::Basename; our $VERSION = '0.001'; my $agent_name = join '/' => basename($0), $VERSION; my $query_base = 'https://duckduckgo.com/html/?q='; my $pdf_re = qr{href="([^"]+\.pdf)"}; my $ua = LWP::UserAgent->new(agent => $agent_name); while (<DATA>) { chomp; my $req = HTTP::Request->new(GET => $query_base . uri_escape($_)); $req->content_type('text/html'); my $res = $ua->request($req); if ($res->is_success) { print "Search successful.\n"; if ($res->content =~ $pdf_re) { my $pdf_url = $1; print "PDF found: $pdf_url\n"; process_pdf_url($pdf_url); } else { print "PDF not found!\n"; } } else { print $res->status_line, "\n"; } } sub process_pdf_url { my $pdf_url = shift; print "Stub - download $pdf_url,\n\trename, upload to database, et +c.\n"; return; } __DATA__ 1. Abilez O, Benharash P, Mehrotra M, Miyamoto E, Gale A, Picquet J +, Xu C, Zarins C (2006) A novel culture system shows that stem cells +can be grown in 3D and under physiologic pulsatile conditions for tis +sue engineering of vascular grafts. J Surg Res 132:170-178.

Output:

$ pm_web_search_pdf.pl Search successful. PDF found: http://med.stanford.edu/arts/arts_students/CVs/CV_abilez_09 +2007.pdf Stub - download http://med.stanford.edu/arts/arts_students/CVs/CV_abil +ez_092007.pdf, rename, upload to database, etc.

-- Ken