comment on

And now a second bit of help, possibly a lot bigger of a bit than previously.

I'm not familiar with HTML::LinkExtor, and I really don't use LWP::UserAgent these days either, so I wrote something taking advantage of my personal favorite for anything webpage related, WWW::Mechanize.

I also never quite understood your original algorithm. If it were me (and in this case it is) I'd keep track of urls (and weeding out duplicates) for a given link depth on my own, in my own data structure, as opposed to inserting things into a database and fetching them back out to re-crawl them.

I'm also not clear on your specs as to whether or not you want urls that are off-site. The logic for the way this program handles that is pretty clearly documented, so if it isn't to your spec, adjust it.

Having said all that, here is a recursive link crawler. (Though now that I type out "recursive link crawler", I can't help but imagine that this hasn't been done before, and I'm certain a search would turn one up fairly quickly. Oh well.)

#!/usr/bin/perl 

use strict;
use warnings;

use WWW::Mechanize;

my $url = shift || die "Please pass in base url as argument to $0\n";
my %visited;
my @links;
my $max_depth = 3;
my $depth = 0;

my $mech = WWW::Mechanize->new();

# This helps prevent following off-site links.
# Note, assumes that url's passed in will represent the
# highest level in a website hierarchy that will be visited.
# i.e. http://www.example.com/dir/ will record a link to
# http://www.example.com/, but will not follow it and report
# subsequent links.
my( $base_uri ) = $url =~ m|^(.*/)|;

get_links( $url );

sub get_links {
  my @urls = @_;
  my @found_links;
  for( @urls ){
    # This prevents following off-site or off-parent links.
    next unless m/^$base_uri/;
    $mech->get( $_ );
    # Filters out links we've already visited, plus mailto's and
    # javascript:etc hrefs.  Adjust to suit.
    @found_links = grep { ++$visited{$_} == 1 && ! /^(mailto|javascrip
+t)/i  }
                   map  { $_->url_abs() }
                   $mech->links();
    push @links, @found_links;
  }
  # Keep going, as long as we should.
  get_links( @found_links ) if $depth++ < $max_depth;
}

# Instead of printing them, you could insert them into the database.
print $_ . "\n" for @links;
[download]

Inserting the links into a database is left as an exercise for the reader.

--chargrill

s**lil*;  $*=join'',sort split q**;  s;.*;grr; &&s+(.(.)).+$2$1+; $; =
qq-$_-;s,.*,ahc,;$,.=chop for split q,,,reverse;print for($,,$;,$*,$/)
[download]

In reply to Re^5: How to extract links from a webpage and store them in a mysql database by chargrill
in thread How to extract links from a webpage and store them in a mysql database by syedahmed.uos

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


go ahead... be a heretic
	PerlMonks