Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

Extract file links from Google search results

by Scott7477 (Chaplain)
on Jun 20, 2007 at 14:17 UTC ( #622253=CUFP: print w/replies, xml ) Need Help??

This code takes an html file with the results of a Google search and extracts the links to files of a specified type to a text file. I have found having such links in a text file useful for automating the download of the files of a specified type, say pdf files for example.
use strict; use LWP::Simple; use HTML::SimpleLinkExtor; #usage googlestrip file:///C:/googlesearchresult.htm > urllist.txt my $url = shift; my $filetype = "pdf"; my $filetypelen = length($filetype); my $offset = -$filetypelen; my $fileget = getstore($url,"tempfile.html"); my $extor = HTML::SimpleLinkExtor->new(); $extor->parse_file("tempfile.html"); my @a_hrefs = $extor->a; my @pdflist; for my $element (@a_hrefs) { my $suffix = substr($element,$offset,$filetypelen); if ($suffix =~ m/$filetype/) { push @pdflist, $element; } } for my $url (@pdflist) { next if ($url =~ m/\/s.*pdf/); print $url; print "\n"; } unlink "tempfile.html" or die "can't unlink tempfile.html: $!";

Updated to eliminate unnecessary list per blazar's suggestion...also added code to delete tempfile after the code is done with it...

Replies are listed 'Best First'.
Re: Extract file links from Google search results
by blazar (Canon) on Jun 22, 2007 at 16:07 UTC
    This code takes an html file with the results of a Google search and extracts the links to files of a specified type to a text file. I have found having such links in a text file useful for automating the download of the files of a specified type, say pdf files for example.

    First of all, let me tell you that however typical, this is a nice example for a CUfP, given the instructive value. Also, while I'm the prototypical guy yelling at newbies not to parse *HTML with regexen, with some shame I admit that in the past when I needed "this sorta things" I used to do that myself, in oneliners. Of course I was not 100% concerned about reliability in those cases. In any case, well done!

    Then I have some remarks. First of all, and without entering in specific locations of the code proper, in some points you seem to factor apart the "pdf" extension, so that it seems to pave the way for improvements to the effect of letting a user specify one, and in some other ones you seem to hardcode it again. Also, the choice of variable names like @pdflist may be slightly misleading in the long run.

    #usage googlestrip file:///C:/googlesearchresult.htm > urllist.txt

    Don't you think you would either want to pass the program an actual file to slurp in with Perl's own tools or a generic url to get off the web? Also... I see nothing that's Google specific, so you may have made the whole thingy more agnostic namewise.

    my $fileget = getstore($url,"tempfile.html");

    Why storing into an actual file on disc? Why not a simple get()? The file won't be huge anyway. But if really wanting it on disc, why hardcoding it? (Without unlinking it that I can see.) Why not use a File::Temp one instead?

    my $suffix = substr($element,$offset,$filetypelen); if ($suffix =~ m/$filetype/) { push @pdflist, $element;

    You know, sometimes I feel people tend to abuse regexen where other tools (like substr or index) would do. But in this particular case you're getting it just the opposite. It smells slightly moronzillonic. Curiously enough, given your approach, the last test does use a match (with no \Q), whereas a simple eq would have been better suited.

    Also, the whole thing is much like a single grep.

    my @list = sort @pdflist;

    And you have created yet another array just to hold some values, when one would have sufficed.

    for my $url (@list) { next if ($url =~ m/\/s.*pdf/); print $url; print "\n"; }

    Awkward regex (what are you trying to do anyway? I suppose this is an ad hoc solution to a requirement of yours.) And awkward flow control. Why doing the check on the sorted list anyway, and not next to the previous one?

    All in all I'd rewrite your app in the following manner, which also behaves in a slightly different way:

    #!/usr/bin/perl use strict; use warnings; use LWP::Simple; use HTML::SimpleLinkExtor; die "Usage: $0 URL <extension> [<extenstions>]\n" unless @ARGV >= 2; my $url=shift; my $wanted=join '|', map quotemeta, @ARGV; $wanted=qr/\.(?:$wanted)$/; defined(my $html=get $url) or die "Couldn't get <$url>\n"; { local $,="\n"; print sort grep /$wanted/, HTML::SimpleLinkExtor->new->parse($html)->a; } __END__
      Thanks for the comments, blazar. I appreciate your taking the time to look at my code. I must admit that when it comes to regexes I am very much an newbie. With regard to getting generic urls, I wanted specifically the link urls's that a Google search generates, which tend to be long hairballs, which I didn't know how to handle well directly.

      In any case your comments have given me some ideas as for tools to use in the future.

      Ravenor
      See my Standard Code Disclaimer
        I must admit that when it comes to regexes I am very much an newbie.

        /me too, that's why I generally try not to be too smart with them. Of course sometimes sharpening one's own skills is not too bad and a look at the documentation is well worth.

        With regard to getting generic urls, I wanted specifically the link urls's that a Google search generates, which tend to be long hairballs, which I didn't know how to handle well directly.

        Well, google urls can be as simple as http://www.google.com/search?q=cool+perl+stuff. In fact, notwithstanding FF's cool search box available at a cheap keybinding, I often find myself composing them manually. In that case two parameters that happen to be useful for me are num and filter as in num=100&filter=0. Of course, this has nothing to do with Perl...

Re: Extract file links from Google search results
by billisdog (Sexton) on Jun 25, 2007 at 15:07 UTC
    One thing you should also keep in mind is that according to Google's TOS, they can start blocking any requests that appear to be coming from a script or as part of a search engine agglomerator. The limit, to be fair, is quite high- since some of us are perfectly capable of generating five thousand legitimate google searches in a day- but if you are building this as part of a web app that might rack up hundreds of searches a minute, you may find google no longer responds. That's why they have the Search API, which, surprise, limits you to a few thousand requests per day. :(
      Your point is well taken; how I get my Google pages to process is through manually downloading them as I am well aware of their TOS. Clearly, using the search API would be a cleaner way to accomplish my task; I was trying for a quick and dirty solution. I happen to have an API key myself; just haven't had the time to put together an app using it yet. Interesting to know that they won't cut you off too quickly...
Re: Extract file links from Google search results
by Anonymous Monk on Jul 26, 2007 at 21:49 UTC
    Something similar with the general-purpose xml/html extract utility xmlgrep:

    GET -HUser-Agent:Mozilla/5.0 'http://www.google.com/search?hl=hr&q=bbbike&btnG=Google+pretraga&lr=' | xmlgrep -parse-html -as-html '//a[@class="l"]/@href'

    Basically, xmlgrep is just a grep which uses XPath expressions. It can be downloaded here.

Re: Extract file links from Google search results
by Anonymous Monk on Jul 23, 2007 at 15:59 UTC
    useful to know how to do, but wget has this as a standard feature.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: CUFP [id://622253]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others lurking in the Monastery: (3)
As of 2021-06-18 04:47 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    What does the "s" stand for in "perls"? (Whence perls)












    Results (87 votes). Check out past polls.

    Notices?