Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Download references list in pdf format with script

by Anonymous Monk
on Oct 26, 2012 at 01:43 UTC ( #1000979=perlquestion: print w/ replies, xml ) Need Help??
Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hello monks. I have been assigned a task involving taking a list of references, downloading the corresponding pdf, naming the pdf, and uploading it to a private database. The list is huge and I know there must be a good way to write a script to do it for me. I simply have to use google or other tools to find the pdfs, there is no specified way to get the pdfs, but the references should have enough information for a direct match. Does anyone have any ideas on how I might go about writing a script to do this? Thanks so much !! below is an example reference, the list is a few thousand just like this

1.    Abilez O, Benharash P, Mehrotra M, Miyamoto E, Gale A, Picquet J, Xu C, Zarins C (2006) A novel culture system shows that stem cells can be grown in 3D and under physiologic pulsatile conditions for tissue engineering of vascular grafts. J Surg Res 132:170-178.

Comment on Download references list in pdf format with script
Download Code
Re: Download references list in pdf format with script
by kcott (Abbot) on Oct 26, 2012 at 03:04 UTC

    Working on the assumption that the references will only find one PDF (which I'm not entirely convinced of), the following code should give you a starting point.

    #!/usr/bin/env perl use strict; use warnings; use LWP::UserAgent; use URI::Escape; use File::Basename; our $VERSION = '0.001'; my $agent_name = join '/' => basename($0), $VERSION; my $query_base = 'https://duckduckgo.com/html/?q='; my $pdf_re = qr{href="([^"]+\.pdf)"}; my $ua = LWP::UserAgent->new(agent => $agent_name); while (<DATA>) { chomp; my $req = HTTP::Request->new(GET => $query_base . uri_escape($_)); $req->content_type('text/html'); my $res = $ua->request($req); if ($res->is_success) { print "Search successful.\n"; if ($res->content =~ $pdf_re) { my $pdf_url = $1; print "PDF found: $pdf_url\n"; process_pdf_url($pdf_url); } else { print "PDF not found!\n"; } } else { print $res->status_line, "\n"; } } sub process_pdf_url { my $pdf_url = shift; print "Stub - download $pdf_url,\n\trename, upload to database, et +c.\n"; return; } __DATA__ 1. Abilez O, Benharash P, Mehrotra M, Miyamoto E, Gale A, Picquet J +, Xu C, Zarins C (2006) A novel culture system shows that stem cells +can be grown in 3D and under physiologic pulsatile conditions for tis +sue engineering of vascular grafts. J Surg Res 132:170-178.

    Output:

    $ pm_web_search_pdf.pl Search successful. PDF found: http://med.stanford.edu/arts/arts_students/CVs/CV_abilez_09 +2007.pdf Stub - download http://med.stanford.edu/arts/arts_students/CVs/CV_abil +ez_092007.pdf, rename, upload to database, etc.

    -- Ken

      I suspect your code already runs into one of the big problems that the OP will have-- if OP is looking for the paper that's referenced, rather than things that contain the reference, it's likely to be behind a paywall. The simple "grab the first pdf" is likely to get some combination of papers that reference the paper the OP is looking for, and which may be behind paywalls, or CV's of the authors (which you snagged).

        The OP seemed to think that his references would find a direct match; I said I wasn't convinced of this assumption. It would probably be more useful to convey your knowledge of paywalls, etc. to the OP rather than to me.

        I just wrote some code based on the information provided. :-)

        -- Ken

Re: Download references list in pdf format with script
by bitingduck (Friar) on Oct 26, 2012 at 03:22 UTC

    You should probably read up on LWP and DBI, and there's probably a google API that you can use for searching (I haven't done any spidering using a search engine). I learned how to do this sort of thing from O'Reilly's "Spidering Hacks" book, which probably has enough examples that you can cobble something together. You can probably also find enough on the web as well, but it will be more disjoint.

    Ideally there will be a few databases that have the PDFs (various journal article databases) and you can either use APIs that they have, or figure out how to screen scrape it. It probably won't be trivial, but you'll learn a lot of perl on the way.

Re: Download references list in pdf format with script
by BrowserUk (Pope) on Oct 26, 2012 at 03:50 UTC

    I'd strongly recommend that you split your task into three separate processes:

    1. Finding the urls to match the references.

      YOu are probably better off using one of the search engine APi's for this bit.

      And using a human being to review the search results and pick out the appropriate urls.

    2. Downloading the PDFs.

      Once you have your urls, there is no real advantage to using Perl rather than (say) wget for doing the downloading.

      Though perl is ideally suited to driving the process of using wget; checking the success; repeating for failures etc.

    3. Processing the PDFs into your database.

      Once you have the PDFs; whether you use Perl or your DBs bulk uploader to populate the DB will very much depend on what exactly information you are going to store in the DB; and where you will be getting that information from.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    RIP Neil Armstrong

Re: Download references list in pdf format with script
by ansh batra (Friar) on Oct 26, 2012 at 12:26 UTC

    scan list of refence using foreach
    {
    system("wget http://www.google.com/search?q=refenence keywords+filetype:pdf")
    while doing wget, give a destination file name suppose temp
    process temp
    find the first url using regular expressions and save it in a variable
    delete temp
    now wget the url and rename the downloaded pdf file
    and upload it to the server
    }

    P.S sorry i dont have a linux machine and a perl compiler .else would have provided you the code

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1000979]
Approved by kcott
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others browsing the Monastery: (5)
As of 2014-09-03 03:12 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite cookbook is:










    Results (35 votes), past polls