Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

WWW::Google

by Amoe (Friar)
on Jan 21, 2002 at 03:10 UTC ( [id://140263]=sourcecode: print w/replies, xml ) Need Help??
Category: Web Stuff
Author/Contact Info Amoe. See pod.
Description:

Replacement for the WWW::Search::Google module. I apologise for the scrappiness of the code, but at least it works.

Thanks crazyinsomniac and hacker.

Update 06/03/2002: Surprisingly, this module still works . After all the changes that Google has gone through since the time I first released it, I would expect it to have broken a long time ago, considering it parses HTML rather than some stable format. There's an interesting story at slashdot about googling via SOAP - maybe this is the future direction this module could take?

package WWW::Google;
use strict;

# Google.pm - amoe 20/01/2002
# hackish module to search google programmatically

use LWP::UserAgent;
use HTTP::Request;
use HTML::TokeParser;
use URI::Escape;

# /me apologises in advance

sub new {
    my $class = shift;
    my $self = bless {}, $class;
    
    my $agent_name = shift 
    || "WWW-Google/0.1 ($^O; http://amoe.perlmonk.org/techno/perl/proj
+ects/www_google/)";
    my $agent = LWP::UserAgent->new;
    $agent->agent($agent_name);
    
    $self->{cgiloc} = ['http://www.google.com/',
                       'search'];
    $self->{place}  = 0;
    $self->{agent}  = $agent;
    
    while (my ($key, $value) = splice @_, 0, 2) {
        $self->{$key} = $value;
    }
    
    return $self;
}

sub build {
    my $self = shift;
    
    my @bits = $self->cgiloc;
    
    my $query = join('' => shift @bits, shift @bits,
                            '?', 'q=', $self->query);
    if (@bits) {
        $query .= '&' . join('&', @bits);
    }
    
    my $res = $self->agent->request(HTTP::Request->new(GET => $query))
+;
    
    my $parsee = HTML::TokeParser->new(\$res->content);
    $self->parsee($parsee);
    
    return $res;
}

sub next_result {
    my $self = shift;
    my $result = {};
    
    while (!%$result) {
        while (my $tag = $self->parsee->get_tag('p')) {
            my $a = $self->parsee->get_tag;
            unless ($a->[0] eq 'a') {
                $self->parsee->unget_token($a);
                next;
            }
            $result->{url}   = $a->[1]->{href};
            $result->{title} = $self->parsee->get_trimmed_text('/a');
            
            return $result;
        }
    } continue {
        $self->place($self->place + 10);
        
        $self->cgiloc(($self->cgiloc)[0, 1],
                       'start=' . $self->place);
                      
        $self->build;
    }
}

sub query {
    my $self = shift;
    if (@_) {
        $self->{query} = uri_escape(shift);
    } else {
        return $self->{query};
    }
}

sub place {
    my $self = shift;
    if (@_) {
        $self->{place} = shift;
    } else {
        return $self->{place};
    }
}

sub cgiloc {
    my $self = shift;
    if (@_) {
        $self->{cgiloc} = [@_];
    } else {
        return @{$self->{cgiloc}};
    }
}

sub parsee {
    my $self = shift;
    if (@_) {
        $self->{parsee} = shift;
    } else {
        return $self->{parsee};
    }
}

sub agent { shift->{agent} }

1;

__END__

=pod

=head1 NAME

WWW::Google - Temporary replacement for WWW::Search::Google

=head1 SYNOPSIS

 use WWW::Google;

 my $search = WWW::Google->new;

 # build up query in $q

 $search->query($q);
 $search->build;

 while (my $res = $search->next_result) {
     print $res->{url}, ': ', $res->{title};
 }

 $search->cgiloc('http://www.google.de', 'search');    # use german go
+ogle
 $search->place(50);    # start at page 50

=head1 DESCRIPTION

This module uses the search engine Google to find websites related to 
+a
particular term. The C<WWW::Search> modules are supposed to do this, b
+ut it
seems none of them work properly. So I decided to code up a hackish re
+placement
to use in the meantime. And here it is. And here are its methods:

=over 4

=item new

Returns a C<WWW::Google> object. Takes the name of the search robot as
+ the
first argument, followed by an optional list of name-value pairs to se
+t the
object up. Possible values are cgiloc, place and query, all of which p
+erform
basically the same task as the method of the same names, with one exce
+ption:
query-strings are autoescaped in C<query> the method, whereas they're 
+passed in
raw if you use the C<new> interface.

=item build

Gets a query page and sets it up for parsing. It takes no arguments, a
+nd must
be called before C<next_result> is.

=item query

Sets the query for the object to use when C<build> gets called. If cal
+led
without argument, returns the current query string. Queries are automa
+tically
URI-encoded.

=item place

The amount of results to start the search as. By default, it starts at
+ the
first page of results, i.e. C<0>. Multiples of ten are probably best.

=item cgiloc

Specify a different location for C<build> to get the query result from
+. Can be
used to specify national variants of Google, presuming they use the sa
+me HTML
format as the google.com one. This is experimental.

=item next_result

Returns a hash containing two keys, C<url> and C<title>, which contain
+ the path
to the search result and the title of the search result. This is what 
+you use
to get the search results. If you use this in a loop, it will probably
+ turn
infinite because of the sheer amount of search results. You'll have to
+ exit it
early with a C<last> or something once you hit your desired amount of 
+results.

=back

=head1 NOTES

THE DADDY OF WHEEL-REINVENTION!

This is almost certainly very buggy - it was written in about an hour,
+ but it
does the job. The code looks horrible and probably runs slower than it
+ should.

People will probably be wanting the excerpt of text Google provides. W
+ell, I
found it was pretty hard to parse this - the problem being that some s
+ites have
categories and some don't, so how can you judge where the text ends? W
+ell, you
can, but I couldn't be bothered at the time. I will get around to it.

=head1 AUTHOR

Amoe. Thanks to crazyinsomniac and hacker.

=head1 CONTACT

Amoe on perlmonks.org.

or email C<subvert underscore you at hotmail dot com>.

The website will be at

 http://amoe.perlmonk.org/techno/perl/projects/www_google/

if I ever get it up.

=head1 COPYRIGHT

Free (substandard) software, daddy.

This program is free software. You may copy or
redistribute it under the same terms as Perl itself.

=cut
Replies are listed 'Best First'.
Re: WWW::Google
by IlyaM (Parson) on Jan 21, 2002 at 03:37 UTC
    IIRC google can return search results in XML. It could be slighly easier and more errorprone to parse it than parse HTML which can be changed in any day.

    --
    Ilya Martynov (http://martynov.org/)

      Definitely, it would be preferable to do that. I put a little research into the topic and couldn't find anything - Mostly only searched their "Services" page though.

      That would be much better, I could solve some parsing problems...*checks*



      --
      my one true love
        This PDF file mentions that you can use HTTP requests like
        GET http://google.com/xml?q=YOUR_QUERY_HERE
        to get search results in XML.

        But I've just checked it again and it seems it doesn't work anymore :(

        --
        Ilya Martynov (http://martynov.org/)

Re: WWW::Google
by Amoe (Friar) on Mar 19, 2002 at 16:33 UTC

    Having seen hossman's enlightening node, I suppose I'd better disclaim this module. It probably goes without saying that I didn't read the Google TOS. I agree with hossman on this issue; the existence of this code isn't a violation of the TOS. If you're paranoid, you may wish to change the user-agent string it sends. Use at your own risk, and stuff.

    And as for the TOS itself, if it wasn't serious it would be funny. For those worried, I think there isn't much chance that Google will sue you for using this module. I think that the TOS itself is overly harsh; as hossman noted, you could call all sorts of things automated searching. For example, I search by typing a phrase into my location bar in Mozilla and clicking a search button. Because I didn't enter the phrase in the textbox, does that make it illegal?

    As ever, IANAL. God, I hate the legal system.


    --
    my one true love

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: sourcecode [id://140263]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others pondering the Monastery: (6)
As of 2024-03-28 11:09 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found