Category: Web Stuff
Author/Contact Info Amoe. See pod.

Replacement for the WWW::Search::Google module. I apologise for the scrappiness of the code, but at least it works.

Thanks crazyinsomniac and hacker.

Update 06/03/2002: Surprisingly, this module still works . After all the changes that Google has gone through since the time I first released it, I would expect it to have broken a long time ago, considering it parses HTML rather than some stable format. There's an interesting story at slashdot about googling via SOAP - maybe this is the future direction this module could take?

package WWW::Google;
use strict;

# - amoe 20/01/2002
# hackish module to search google programmatically

use LWP::UserAgent;
use HTTP::Request;
use HTML::TokeParser;
use URI::Escape;

# /me apologises in advance

sub new {
    my $class = shift;
    my $self = bless {}, $class;
    my $agent_name = shift 
    || "WWW-Google/0.1 ($^O;
    my $agent = LWP::UserAgent->new;
    $self->{cgiloc} = ['',
    $self->{place}  = 0;
    $self->{agent}  = $agent;
    while (my ($key, $value) = splice @_, 0, 2) {
        $self->{$key} = $value;
    return $self;

sub build {
    my $self = shift;
    my @bits = $self->cgiloc;
    my $query = join('' => shift @bits, shift @bits,
                            '?', 'q=', $self->query);
    if (@bits) {
        $query .= '&' . join('&', @bits);
    my $res = $self->agent->request(HTTP::Request->new(GET => $query))
    my $parsee = HTML::TokeParser->new(\$res->content);
    return $res;

sub next_result {
    my $self = shift;
    my $result = {};
    while (!%$result) {
        while (my $tag = $self->parsee->get_tag('p')) {
            my $a = $self->parsee->get_tag;
            unless ($a->[0] eq 'a') {
            $result->{url}   = $a->[1]->{href};
            $result->{title} = $self->parsee->get_trimmed_text('/a');
            return $result;
    } continue {
        $self->place($self->place + 10);
        $self->cgiloc(($self->cgiloc)[0, 1],
                       'start=' . $self->place);

sub query {
    my $self = shift;
    if (@_) {
        $self->{query} = uri_escape(shift);
    } else {
        return $self->{query};

sub place {
    my $self = shift;
    if (@_) {
        $self->{place} = shift;
    } else {
        return $self->{place};

sub cgiloc {
    my $self = shift;
    if (@_) {
        $self->{cgiloc} = [@_];
    } else {
        return @{$self->{cgiloc}};

sub parsee {
    my $self = shift;
    if (@_) {
        $self->{parsee} = shift;
    } else {
        return $self->{parsee};

sub agent { shift->{agent} }




=head1 NAME

WWW::Google - Temporary replacement for WWW::Search::Google


 use WWW::Google;

 my $search = WWW::Google->new;

 # build up query in $q


 while (my $res = $search->next_result) {
     print $res->{url}, ': ', $res->{title};

 $search->cgiloc('', 'search');    # use german go
 $search->place(50);    # start at page 50


This module uses the search engine Google to find websites related to 
particular term. The C<WWW::Search> modules are supposed to do this, b
+ut it
seems none of them work properly. So I decided to code up a hackish re
to use in the meantime. And here it is. And here are its methods:

=over 4

=item new

Returns a C<WWW::Google> object. Takes the name of the search robot as
+ the
first argument, followed by an optional list of name-value pairs to se
+t the
object up. Possible values are cgiloc, place and query, all of which p
basically the same task as the method of the same names, with one exce
query-strings are autoescaped in C<query> the method, whereas they're 
+passed in
raw if you use the C<new> interface.

=item build

Gets a query page and sets it up for parsing. It takes no arguments, a
+nd must
be called before C<next_result> is.

=item query

Sets the query for the object to use when C<build> gets called. If cal
without argument, returns the current query string. Queries are automa

=item place

The amount of results to start the search as. By default, it starts at
+ the
first page of results, i.e. C<0>. Multiples of ten are probably best.

=item cgiloc

Specify a different location for C<build> to get the query result from
+. Can be
used to specify national variants of Google, presuming they use the sa
+me HTML
format as the one. This is experimental.

=item next_result

Returns a hash containing two keys, C<url> and C<title>, which contain
+ the path
to the search result and the title of the search result. This is what 
+you use
to get the search results. If you use this in a loop, it will probably
+ turn
infinite because of the sheer amount of search results. You'll have to
+ exit it
early with a C<last> or something once you hit your desired amount of 


=head1 NOTES


This is almost certainly very buggy - it was written in about an hour,
+ but it
does the job. The code looks horrible and probably runs slower than it
+ should.

People will probably be wanting the excerpt of text Google provides. W
+ell, I
found it was pretty hard to parse this - the problem being that some s
+ites have
categories and some don't, so how can you judge where the text ends? W
+ell, you
can, but I couldn't be bothered at the time. I will get around to it.

=head1 AUTHOR

Amoe. Thanks to crazyinsomniac and hacker.

=head1 CONTACT

Amoe on

or email C<subvert underscore you at hotmail dot com>.

The website will be at

if I ever get it up.


Free (substandard) software, daddy.

This program is free software. You may copy or
redistribute it under the same terms as Perl itself.
