Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

WWW::Google

by Amoe (Friar)
on Jan 21, 2002 at 03:10 UTC ( #140263=sourcecode: print w/ replies, xml ) Need Help??

Category: Web Stuff
Author/Contact Info Amoe. See pod.
Description:

Replacement for the WWW::Search::Google module. I apologise for the scrappiness of the code, but at least it works.

Thanks crazyinsomniac and hacker.

Update 06/03/2002: Surprisingly, this module still works . After all the changes that Google has gone through since the time I first released it, I would expect it to have broken a long time ago, considering it parses HTML rather than some stable format. There's an interesting story at slashdot about googling via SOAP - maybe this is the future direction this module could take?

package WWW::Google;
use strict;

# Google.pm - amoe 20/01/2002
# hackish module to search google programmatically

use LWP::UserAgent;
use HTTP::Request;
use HTML::TokeParser;
use URI::Escape;

# /me apologises in advance

sub new {
    my $class = shift;
    my $self = bless {}, $class;
    
    my $agent_name = shift 
    || "WWW-Google/0.1 ($^O; http://amoe.perlmonk.org/techno/perl/proj
+ects/www_google/)";
    my $agent = LWP::UserAgent->new;
    $agent->agent($agent_name);
    
    $self->{cgiloc} = ['http://www.google.com/',
                       'search'];
    $self->{place}  = 0;
    $self->{agent}  = $agent;
    
    while (my ($key, $value) = splice @_, 0, 2) {
        $self->{$key} = $value;
    }
    
    return $self;
}

sub build {
    my $self = shift;
    
    my @bits = $self->cgiloc;
    
    my $query = join('' => shift @bits, shift @bits,
                            '?', 'q=', $self->query);
    if (@bits) {
        $query .= '&' . join('&', @bits);
    }
    
    my $res = $self->agent->request(HTTP::Request->new(GET => $query))
+;
    
    my $parsee = HTML::TokeParser->new(\$res->content);
    $self->parsee($parsee);
    
    return $res;
}

sub next_result {
    my $self = shift;
    my $result = {};
    
    while (!%$result) {
        while (my $tag = $self->parsee->get_tag('p')) {
            my $a = $self->parsee->get_tag;
            unless ($a->[0] eq 'a') {
                $self->parsee->unget_token($a);
                next;
            }
            $result->{url}   = $a->[1]->{href};
            $result->{title} = $self->parsee->get_trimmed_text('/a');
            
            return $result;
        }
    } continue {
        $self->place($self->place + 10);
        
        $self->cgiloc(($self->cgiloc)[0, 1],
                       'start=' . $self->place);
                      
        $self->build;
    }
}

sub query {
    my $self = shift;
    if (@_) {
        $self->{query} = uri_escape(shift);
    } else {
        return $self->{query};
    }
}

sub place {
    my $self = shift;
    if (@_) {
        $self->{place} = shift;
    } else {
        return $self->{place};
    }
}

sub cgiloc {
    my $self = shift;
    if (@_) {
        $self->{cgiloc} = [@_];
    } else {
        return @{$self->{cgiloc}};
    }
}

sub parsee {
    my $self = shift;
    if (@_) {
        $self->{parsee} = shift;
    } else {
        return $self->{parsee};
    }
}

sub agent { shift->{agent} }

1;

__END__

=pod

=head1 NAME

WWW::Google - Temporary replacement for WWW::Search::Google

=head1 SYNOPSIS

 use WWW::Google;

 my $search = WWW::Google->new;

 # build up query in $q

 $search->query($q);
 $search->build;

 while (my $res = $search->next_result) {
     print $res->{url}, ': ', $res->{title};
 }

 $search->cgiloc('http://www.google.de', 'search');    # use german go
+ogle
 $search->place(50);    # start at page 50

=head1 DESCRIPTION

This module uses the search engine Google to find websites related to 
+a
particular term. The C<WWW::Search> modules are supposed to do this, b
+ut it
seems none of them work properly. So I decided to code up a hackish re
+placement
to use in the meantime. And here it is. And here are its methods:

=over 4

=item new

Returns a C<WWW::Google> object. Takes the name of the search robot as
+ the
first argument, followed by an optional list of name-value pairs to se
+t the
object up. Possible values are cgiloc, place and query, all of which p
+erform
basically the same task as the method of the same names, with one exce
+ption:
query-strings are autoescaped in C<query> the method, whereas they're 
+passed in
raw if you use the C<new> interface.

=item build

Gets a query page and sets it up for parsing. It takes no arguments, a
+nd must
be called before C<next_result> is.

=item query

Sets the query for the object to use when C<build> gets called. If cal
+led
without argument, returns the current query string. Queries are automa
+tically
URI-encoded.

=item place

The amount of results to start the search as. By default, it starts at
+ the
first page of results, i.e. C<0>. Multiples of ten are probably best.

=item cgiloc

Specify a different location for C<build> to get the query result from
+. Can be
used to specify national variants of Google, presuming they use the sa
+me HTML
format as the google.com one. This is experimental.

=item next_result

Returns a hash containing two keys, C<url> and C<title>, which contain
+ the path
to the search result and the title of the search result. This is what 
+you use
to get the search results. If you use this in a loop, it will probably
+ turn
infinite because of the sheer amount of search results. You'll have to
+ exit it
early with a C<last> or something once you hit your desired amount of 
+results.

=back

=head1 NOTES

THE DADDY OF WHEEL-REINVENTION!

This is almost certainly very buggy - it was written in about an hour,
+ but it
does the job. The code looks horrible and probably runs slower than it
+ should.

People will probably be wanting the excerpt of text Google provides. W
+ell, I
found it was pretty hard to parse this - the problem being that some s
+ites have
categories and some don't, so how can you judge where the text ends? W
+ell, you
can, but I couldn't be bothered at the time. I will get around to it.

=head1 AUTHOR

Amoe. Thanks to crazyinsomniac and hacker.

=head1 CONTACT

Amoe on perlmonks.org.

or email C<subvert underscore you at hotmail dot com>.

The website will be at

 http://amoe.perlmonk.org/techno/perl/projects/www_google/

if I ever get it up.

=head1 COPYRIGHT

Free (substandard) software, daddy.

This program is free software. You may copy or
redistribute it under the same terms as Perl itself.

=cut

Comment on WWW::Google
Download Code
Replies are listed 'Best First'.
Re: WWW::Google
by IlyaM (Parson) on Jan 21, 2002 at 03:37 UTC
    IIRC google can return search results in XML. It could be slighly easier and more errorprone to parse it than parse HTML which can be changed in any day.

    --
    Ilya Martynov (http://martynov.org/)

      Definitely, it would be preferable to do that. I put a little research into the topic and couldn't find anything - Mostly only searched their "Services" page though.

      That would be much better, I could solve some parsing problems...*checks*



      --
      my one true love
        This PDF file mentions that you can use HTTP requests like
        GET http://google.com/xml?q=YOUR_QUERY_HERE
        to get search results in XML.

        But I've just checked it again and it seems it doesn't work anymore :(

        --
        Ilya Martynov (http://martynov.org/)

Re: WWW::Google
by Amoe (Friar) on Mar 19, 2002 at 16:33 UTC

    Having seen hossman's enlightening node, I suppose I'd better disclaim this module. It probably goes without saying that I didn't read the Google TOS. I agree with hossman on this issue; the existence of this code isn't a violation of the TOS. If you're paranoid, you may wish to change the user-agent string it sends. Use at your own risk, and stuff.

    And as for the TOS itself, if it wasn't serious it would be funny. For those worried, I think there isn't much chance that Google will sue you for using this module. I think that the TOS itself is overly harsh; as hossman noted, you could call all sorts of things automated searching. For example, I search by typing a phrase into my location bar in Mozilla and clicking a search button. Because I didn't enter the phrase in the textbox, does that make it illegal?

    As ever, IANAL. God, I hate the legal system.


    --
    my one true love

Back to Code Catacombs

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: sourcecode [id://140263]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others cooling their heels in the Monastery: (11)
As of 2015-07-28 13:19 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (254 votes), past polls