Description: |
Replacement for the WWW::Search::Google module. I apologise for the scrappiness of the code, but at least it works.
Thanks crazyinsomniac and hacker.
Update 06/03/2002: Surprisingly, this module still works . After all the changes that Google has gone through since the time I first released it, I would expect it to have broken a long time ago, considering it parses HTML rather than some stable format. There's an interesting story at slashdot about googling via SOAP - maybe this is the future direction this module could take? |
package WWW::Google;
use strict;
# Google.pm - amoe 20/01/2002
# hackish module to search google programmatically
use LWP::UserAgent;
use HTTP::Request;
use HTML::TokeParser;
use URI::Escape;
# /me apologises in advance
sub new {
my $class = shift;
my $self = bless {}, $class;
my $agent_name = shift
|| "WWW-Google/0.1 ($^O; http://amoe.perlmonk.org/techno/perl/proj
+ects/www_google/)";
my $agent = LWP::UserAgent->new;
$agent->agent($agent_name);
$self->{cgiloc} = ['http://www.google.com/',
'search'];
$self->{place} = 0;
$self->{agent} = $agent;
while (my ($key, $value) = splice @_, 0, 2) {
$self->{$key} = $value;
}
return $self;
}
sub build {
my $self = shift;
my @bits = $self->cgiloc;
my $query = join('' => shift @bits, shift @bits,
'?', 'q=', $self->query);
if (@bits) {
$query .= '&' . join('&', @bits);
}
my $res = $self->agent->request(HTTP::Request->new(GET => $query))
+;
my $parsee = HTML::TokeParser->new(\$res->content);
$self->parsee($parsee);
return $res;
}
sub next_result {
my $self = shift;
my $result = {};
while (!%$result) {
while (my $tag = $self->parsee->get_tag('p')) {
my $a = $self->parsee->get_tag;
unless ($a->[0] eq 'a') {
$self->parsee->unget_token($a);
next;
}
$result->{url} = $a->[1]->{href};
$result->{title} = $self->parsee->get_trimmed_text('/a');
return $result;
}
} continue {
$self->place($self->place + 10);
$self->cgiloc(($self->cgiloc)[0, 1],
'start=' . $self->place);
$self->build;
}
}
sub query {
my $self = shift;
if (@_) {
$self->{query} = uri_escape(shift);
} else {
return $self->{query};
}
}
sub place {
my $self = shift;
if (@_) {
$self->{place} = shift;
} else {
return $self->{place};
}
}
sub cgiloc {
my $self = shift;
if (@_) {
$self->{cgiloc} = [@_];
} else {
return @{$self->{cgiloc}};
}
}
sub parsee {
my $self = shift;
if (@_) {
$self->{parsee} = shift;
} else {
return $self->{parsee};
}
}
sub agent { shift->{agent} }
1;
__END__
=pod
=head1 NAME
WWW::Google - Temporary replacement for WWW::Search::Google
=head1 SYNOPSIS
use WWW::Google;
my $search = WWW::Google->new;
# build up query in $q
$search->query($q);
$search->build;
while (my $res = $search->next_result) {
print $res->{url}, ': ', $res->{title};
}
$search->cgiloc('http://www.google.de', 'search'); # use german go
+ogle
$search->place(50); # start at page 50
=head1 DESCRIPTION
This module uses the search engine Google to find websites related to
+a
particular term. The C<WWW::Search> modules are supposed to do this, b
+ut it
seems none of them work properly. So I decided to code up a hackish re
+placement
to use in the meantime. And here it is. And here are its methods:
=over 4
=item new
Returns a C<WWW::Google> object. Takes the name of the search robot as
+ the
first argument, followed by an optional list of name-value pairs to se
+t the
object up. Possible values are cgiloc, place and query, all of which p
+erform
basically the same task as the method of the same names, with one exce
+ption:
query-strings are autoescaped in C<query> the method, whereas they're
+passed in
raw if you use the C<new> interface.
=item build
Gets a query page and sets it up for parsing. It takes no arguments, a
+nd must
be called before C<next_result> is.
=item query
Sets the query for the object to use when C<build> gets called. If cal
+led
without argument, returns the current query string. Queries are automa
+tically
URI-encoded.
=item place
The amount of results to start the search as. By default, it starts at
+ the
first page of results, i.e. C<0>. Multiples of ten are probably best.
=item cgiloc
Specify a different location for C<build> to get the query result from
+. Can be
used to specify national variants of Google, presuming they use the sa
+me HTML
format as the google.com one. This is experimental.
=item next_result
Returns a hash containing two keys, C<url> and C<title>, which contain
+ the path
to the search result and the title of the search result. This is what
+you use
to get the search results. If you use this in a loop, it will probably
+ turn
infinite because of the sheer amount of search results. You'll have to
+ exit it
early with a C<last> or something once you hit your desired amount of
+results.
=back
=head1 NOTES
THE DADDY OF WHEEL-REINVENTION!
This is almost certainly very buggy - it was written in about an hour,
+ but it
does the job. The code looks horrible and probably runs slower than it
+ should.
People will probably be wanting the excerpt of text Google provides. W
+ell, I
found it was pretty hard to parse this - the problem being that some s
+ites have
categories and some don't, so how can you judge where the text ends? W
+ell, you
can, but I couldn't be bothered at the time. I will get around to it.
=head1 AUTHOR
Amoe. Thanks to crazyinsomniac and hacker.
=head1 CONTACT
Amoe on perlmonks.org.
or email C<subvert underscore you at hotmail dot com>.
The website will be at
http://amoe.perlmonk.org/techno/perl/projects/www_google/
if I ever get it up.
=head1 COPYRIGHT
Free (substandard) software, daddy.
This program is free software. You may copy or
redistribute it under the same terms as Perl itself.
=cut
|