comment on

package WWW::Google;
use strict;

# Google.pm - amoe 20/01/2002
# hackish module to search google programmatically

use LWP::UserAgent;
use HTTP::Request;
use HTML::TokeParser;
use URI::Escape;

# /me apologises in advance

sub new {
    my $class = shift;
    my $self = bless {}, $class;
    
    my $agent_name = shift 
    || "WWW-Google/0.1 ($^O; http://amoe.perlmonk.org/techno/perl/proj
+ects/www_google/)";
    my $agent = LWP::UserAgent->new;
    $agent->agent($agent_name);
    
    $self->{cgiloc} = ['http://www.google.com/',
                       'search'];
    $self->{place}  = 0;
    $self->{agent}  = $agent;
    
    while (my ($key, $value) = splice @_, 0, 2) {
        $self->{$key} = $value;
    }
    
    return $self;
}

sub build {
    my $self = shift;
    
    my @bits = $self->cgiloc;
    
    my $query = join('' => shift @bits, shift @bits,
                            '?', 'q=', $self->query);
    if (@bits) {
        $query .= '&' . join('&', @bits);
    }
    
    my $res = $self->agent->request(HTTP::Request->new(GET => $query))
+;
    
    my $parsee = HTML::TokeParser->new(\$res->content);
    $self->parsee($parsee);
    
    return $res;
}

sub next_result {
    my $self = shift;
    my $result = {};
    
    while (!%$result) {
        while (my $tag = $self->parsee->get_tag('p')) {
            my $a = $self->parsee->get_tag;
            unless ($a->[0] eq 'a') {
                $self->parsee->unget_token($a);
                next;
            }
            $result->{url}   = $a->[1]->{href};
            $result->{title} = $self->parsee->get_trimmed_text('/a');
            
            return $result;
        }
    } continue {
        $self->place($self->place + 10);
        
        $self->cgiloc(($self->cgiloc)[0, 1],
                       'start=' . $self->place);
                      
        $self->build;
    }
}

sub query {
    my $self = shift;
    if (@_) {
        $self->{query} = uri_escape(shift);
    } else {
        return $self->{query};
    }
}

sub place {
    my $self = shift;
    if (@_) {
        $self->{place} = shift;
    } else {
        return $self->{place};
    }
}

sub cgiloc {
    my $self = shift;
    if (@_) {
        $self->{cgiloc} = [@_];
    } else {
        return @{$self->{cgiloc}};
    }
}

sub parsee {
    my $self = shift;
    if (@_) {
        $self->{parsee} = shift;
    } else {
        return $self->{parsee};
    }
}

sub agent { shift->{agent} }

1;

__END__

=pod

=head1 NAME

WWW::Google - Temporary replacement for WWW::Search::Google

=head1 SYNOPSIS

 use WWW::Google;

 my $search = WWW::Google->new;

 # build up query in $q

 $search->query($q);
 $search->build;

 while (my $res = $search->next_result) {
     print $res->{url}, ': ', $res->{title};
 }

 $search->cgiloc('http://www.google.de', 'search');    # use german go
+ogle
 $search->place(50);    # start at page 50

=head1 DESCRIPTION

This module uses the search engine Google to find websites related to 
+a
particular term. The C<WWW::Search> modules are supposed to do this, b
+ut it
seems none of them work properly. So I decided to code up a hackish re
+placement
to use in the meantime. And here it is. And here are its methods:

=over 4

=item new

Returns a C<WWW::Google> object. Takes the name of the search robot as
+ the
first argument, followed by an optional list of name-value pairs to se
+t the
object up. Possible values are cgiloc, place and query, all of which p
+erform
basically the same task as the method of the same names, with one exce
+ption:
query-strings are autoescaped in C<query> the method, whereas they're 
+passed in
raw if you use the C<new> interface.

=item build

Gets a query page and sets it up for parsing. It takes no arguments, a
+nd must
be called before C<next_result> is.

=item query

Sets the query for the object to use when C<build> gets called. If cal
+led
without argument, returns the current query string. Queries are automa
+tically
URI-encoded.

=item place

The amount of results to start the search as. By default, it starts at
+ the
first page of results, i.e. C<0>. Multiples of ten are probably best.

=item cgiloc

Specify a different location for C<build> to get the query result from
+. Can be
used to specify national variants of Google, presuming they use the sa
+me HTML
format as the google.com one. This is experimental.

=item next_result

Returns a hash containing two keys, C<url> and C<title>, which contain
+ the path
to the search result and the title of the search result. This is what 
+you use
to get the search results. If you use this in a loop, it will probably
+ turn
infinite because of the sheer amount of search results. You'll have to
+ exit it
early with a C<last> or something once you hit your desired amount of 
+results.

=back

=head1 NOTES

THE DADDY OF WHEEL-REINVENTION!

This is almost certainly very buggy - it was written in about an hour,
+ but it
does the job. The code looks horrible and probably runs slower than it
+ should.

People will probably be wanting the excerpt of text Google provides. W
+ell, I
found it was pretty hard to parse this - the problem being that some s
+ites have
categories and some don't, so how can you judge where the text ends? W
+ell, you
can, but I couldn't be bothered at the time. I will get around to it.

=head1 AUTHOR

Amoe. Thanks to crazyinsomniac and hacker.

=head1 CONTACT

Amoe on perlmonks.org.

or email C<subvert underscore you at hotmail dot com>.

The website will be at

 http://amoe.perlmonk.org/techno/perl/projects/www_google/

if I ever get it up.

=head1 COPYRIGHT

Free (substandard) software, daddy.

This program is free software. You may copy or
redistribute it under the same terms as Perl itself.

=cut
[download]

In reply to WWW::Google by Amoe

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


go ahead... be a heretic
	PerlMonks