comment on

Thanks :)

Actually, I've never used WWW::Mechanize, so it didn't occur to me to try that. The routine I use for scraping the data from the Monk homenodes is given below. I think the main performance hit is the fact that I need to issue a separate request for each Monk. Ideally, it would be good to be able to grab all this information in a single go. But I'm not aware of any way that this is currently possible.

sub get_monk_stats {
    my $ref = shift;
    my $monk_url = 'http://www.perlmonks.org/?node_id=';

    my %monk_fields = (
        'User since:'   => 1,
        'Last here:'    => 1,
        'Experience:'   => 1,
        'Level:'        => 1,
        'Writeups:'     => 1,
    );

    MONK:
    foreach my $id (keys %{$ref}) {
        print "Getting data for $ref->{$id}{name} ($id)\n";
        my $ua = LWP::UserAgent->new();
        my $req = HTTP::Request->new(GET=>"$monk_url$id");
        my $result = $ua->request($req);
        next MONK if !$result->is_success;
        my $content = $result->content;

        my $p = HTML::TokeParser->new(\$content);

        while (my $tag = $p->get_tag("td")) {
            my $text = $p->get_trimmed_text("/td");
            if ($monk_fields{$text}) {
                $p->get_tag("td");
                $ref->{$id}{$text} = $p->get_trimmed_text("/td");
            }
        }
    }
    return $ref;
}
[download]

In reply to Re^2: Google Earth Monks by McDarren
in thread Google Earth Monks by McDarren

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Do you know where your variables are?
	PerlMonks