note
BazB
<p>
Here's my pretty simple try (without using files and just for a single page). Adjust to taste:
<code>
#!/usr/bin/perl
use strict;
use warnings;
use LWP::RobotUA;
use HTML::SimpleLinkExtor;
sub grab_links {
my ( $ua, $url ) = @_;
my @links;
my $response = $ua->get($url);
if ($response->is_success) {
my $extor = HTML::SimpleLinkExtor->new();
my $content = $response->content;
$extor->parse($content);
@links = $extor->a; # get a ref links. Check docs - these are relative paths.
} else {
die $response->status_line;
}
return @links;
}
my $visit = $ARGV[0];
my $ua = LWP::RobotUA->new('my-robot/0.1', 'me@foo.com'); # Change this to suit.
$ua->delay( 0.1 ); # hit every 1/10 second
my @links = grab_links($ua, $visit);
my %uniq;
foreach ( @links ) {
$uniq{$_}++;
}
print "Visited: ", $visit, " found these links:\n", join( "\n", keys %uniq), "\n";
</code>
</p>
<p>
<b>Update:</b> this code was put here after talking to [mkurtis] in the CB. It appears to do most of the things [mkurtis] is after, so I posted it for future reference.<br>
Most of the code was taken straight from the docs for [CPAN://HTML::SimpleLinkExtor], [CPAN://LWP::RobotUA] and [CPAN://LWP::UserAgent].
</p>
<p>
This is the first time I've used any of those modules and it was quite cool :-)
</p>
<div class="pmsig">
<div class="pmsig-127547">
<hr />
<font size=1>
<p align=right>If the information in this post is inaccurate, or just plain wrong, don't just downvote - please post explaining what's wrong.<br>
That way everyone learns.</p>
</font>
</div></div>
335679
335679