Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

Re^3: How to extract links from a webpage and store them in a mysql database

by g0n (Priest)
on Dec 06, 2006 at 12:36 UTC ( #588081=note: print w/ replies, xml ) Need Help??


in reply to Re^2: How to extract links from a webpage and store them in a mysql database
in thread How to extract links from a webpage and store them in a mysql database

Step one is probably to write an algorithm to do what you want. Something like this perhaps:

  • Create your database table with columns for 'link', 'depth', 'read'
  • read the first page and store the base URL
  • for each link in the page, compare its base to the original base URL
  • If they match, add to the DB with depth 2 and read 'no'
  • For each entry in the table where read eq 'no', read the page, set read to 'yes', compare each link base to the original base URL
  • If they match, add to the db with depth 3 and read 'no'
  • repeat the last two steps, setting depth to 4 (i.e. a link found at depth 3)
  • end
You could end when you don't find any entries in the db with depth <=3 and read eq 'no', that way it's easy to modify if you decide to read deeper.

--------------------------------------------------------------

"If there is such a phenomenon as absolute evil, it consists in treating another human being as a thing."
John Brunner, "The Shockwave Rider".


Comment on Re^3: How to extract links from a webpage and store them in a mysql database
Re^4: How to extract links from a webpage and store them in a mysql database
by syedahmed.uos (Novice) on Dec 11, 2006 at 13:51 UTC
    1) create database with three columns for 'depth level' 'link' 'read' 2) depth level = 0 Base url; 3) while $# array_urls 4) write to database ("depth level, $array_url") 5) $depth ++ 6) trigger the db 7) read each link from database and extract further links found, to the databse. 8) until depth level = 3; 9) stop!
Re^4: How to extract links from a webpage and store them in a mysql database
by syedahmed.uos (Novice) on Dec 18, 2006 at 18:37 UTC
    I have to write a program that will repeatedly extract links to a depth level of 3. i.e., after extracting links the first time i store the links in the mysql database. then i fetch these links and iterate the function over each link to extract more links.. so i have to do three times. I have stored and then fetched the links from the database i aint able to run the function on each link and store them back to the database. the code is shown previously.
    #!/usr/bin/perl use LWP::UserAgent; use HTML::LinkExtor; use URI::URL; $url =<>;; # for instance #my $depth = 0; my @link =(); my $ua = LWP::UserAgent->new; # Set up a callback that collect links my @a = (); sub callback { my($tag, %attr) = @_; return if $tag ne 'a'; push(@a,values %attr); } # Make the parser.Unfortunately, we don't know the base yet (it mig +ht be #diffent from $url) my $p = HTML::LinkExtor->new(\&callback); my $res = $ua->request(HTTP::Request->new(GET => $url), sub {$p->parse($_[0])}); # Expand all image URLs to absolute ones my $base = $res->base; @a = map { $_ = url($_, $base)->abs; } @a; # Print them out print join("\n", @a), "\n"; use strict; use DBI(); my $dbh = DBI->connect("DBI:mysql:database=gatxp;host="","",""); #$dbh->do("CREATE TABLE newlinks (md5 INTEGER(100) not null primary k +ey, webpage VARCHAR(80) not null)"); $dbh->do("INSERT INTO newlinks VALUES ('MD5','0','$base','1')"); foreach $a(@a){ $dbh->do ("INSERT INTO newlinks VALUES ('','1','$a','0')"); } my $sth = $dbh->prepare('SELECT * FROM newlinks') or die "Couldn't prepare statement: " . $dbh->errstr; $sth->execute(); while( my $ref = $sth->fetchrow_hashref()){ my $link=$ref->{'webpage'}; foreach $link(@link){ my $usa = LWP::UserAgent->new; $p = HTML::LinkExtor->new(\&callback); my $res = $usa->request(HTTP::Request->new(GET => $link), sub {$p->parse($_[0])}); $base = $res->base; @link = map { $_ = url($_, $base)->abs; } @link; # Print them out print "$$link\n"; $sth->finish(); $dbh->disconnect();

      Here's your first bit of help: I've added use strict; and use warnings; near the top and fixed the resultant errors/warnings. I've run your code through perl tidy (a few times) and fixed up some other errors introduced. I fixed a few quoting and commenting issues introduced possibly by cut-n-paste errors, but definitely by your broken database connection string, which I fixed. I've also terminated your while and for loops (near the end) which may or may not be the right spot to terminate them - I can't actually fully run your code since I don't have all the modules installed nor a database handy at the moment.

      I have run it once and passed it a base URL, it spit out a few of the links on the page, so I suppose it's doing something properly.

      Try running this modified version and see what happens. If you make any changes, please format it for readability before posting again.

      #!/usr/bin/perl use strict; use warnings; use LWP::UserAgent; use HTML::LinkExtor; use URI::URL; use DBI(); my $url = <>; # for instance #my $depth = 0; my @link = (); my $ua = LWP::UserAgent->new; # Set up a callback that collect li +nks my @a = (); sub callback { my( $tag, %attr ) = @_; return if $tag ne 'a'; push( @a, values %attr ); } # Make the parser.Unfortunately, we don't know the base yet (it migh +t be #diffent from $url) my $p = HTML::LinkExtor->new( \&callback ); my $res = $ua->request( HTTP::Request->new( GET => $url ), sub { $p->parse( $_[0] ) } ) ; # Expand all image URLs to absolute ones my $base = $res->base; @a = map { $_ = url( $_, $base )->abs; } @a; # Print them out print join( "\n", @a ), "\n"; my $dbh = DBI->connect( "DBI:mysql:database=gatxp;host=\"\"", "", "" ) +; #$dbh->do(" CREATE TABLE newlinks( md5 INTEGER(100) not null " # ."primary key, webpage VARCHAR(80) not null) "); $dbh->do(" INSERT INTO newlinks VALUES( 'MD5', '0', '$base', '1' ) "); foreach $a (@a) { $dbh->do(" INSERT INTO newlinks VALUES( '', '1', '$a', '0' ) "); } my $sth = $dbh->prepare('SELECT * FROM newlinks') or die " Couldn't prepare statement : " . $dbh->errstr; $sth->execute(); while( my $ref = $sth->fetchrow_hashref() ) { my $link = $ref->{'webpage'}; foreach $link (@link) { my $usa = LWP::UserAgent->new; $p = HTML::LinkExtor->new( \&callback ); my $res = $usa->request( HTTP::Request->new( GET => $link ), sub { $p->parse( $_[0] ) } ); $base = $res->base; @link = map { $_ = url( $_, $base )->abs; } @link; # Print them + out print "$$link \n "; } } $sth->finish(); $dbh->disconnect();

      HTH



      --chargrill
      s**lil*; $*=join'',sort split q**; s;.*;grr; &&s+(.(.)).+$2$1+; $; = qq-$_-;s,.*,ahc,;$,.=chop for split q,,,reverse;print for($,,$;,$*,$/)
        thanks, **mistakenly u have sent me the same code i sent** . and also that i am able to run a code which extracts links from a webpage the code i sent u was to extract further links from the extracted links. this should be to a depth level of three. regards

      And now a second bit of help, possibly a lot bigger of a bit than previously.

      I'm not familiar with HTML::LinkExtor, and I really don't use LWP::UserAgent these days either, so I wrote something taking advantage of my personal favorite for anything webpage related, WWW::Mechanize.

      I also never quite understood your original algorithm. If it were me (and in this case it is) I'd keep track of urls (and weeding out duplicates) for a given link depth on my own, in my own data structure, as opposed to inserting things into a database and fetching them back out to re-crawl them.

      I'm also not clear on your specs as to whether or not you want urls that are off-site. The logic for the way this program handles that is pretty clearly documented, so if it isn't to your spec, adjust it.

      Having said all that, here is a recursive link crawler. (Though now that I type out "recursive link crawler", I can't help but imagine that this hasn't been done before, and I'm certain a search would turn one up fairly quickly. Oh well.)

      #!/usr/bin/perl use strict; use warnings; use WWW::Mechanize; my $url = shift || die "Please pass in base url as argument to $0\n"; my %visited; my @links; my $max_depth = 3; my $depth = 0; my $mech = WWW::Mechanize->new(); # This helps prevent following off-site links. # Note, assumes that url's passed in will represent the # highest level in a website hierarchy that will be visited. # i.e. http://www.example.com/dir/ will record a link to # http://www.example.com/, but will not follow it and report # subsequent links. my( $base_uri ) = $url =~ m|^(.*/)|; get_links( $url ); sub get_links { my @urls = @_; my @found_links; for( @urls ){ # This prevents following off-site or off-parent links. next unless m/^$base_uri/; $mech->get( $_ ); # Filters out links we've already visited, plus mailto's and # javascript:etc hrefs. Adjust to suit. @found_links = grep { ++$visited{$_} == 1 && ! /^(mailto|javascrip +t)/i } map { $_->url_abs() } $mech->links(); push @links, @found_links; } # Keep going, as long as we should. get_links( @found_links ) if $depth++ < $max_depth; } # Instead of printing them, you could insert them into the database. print $_ . "\n" for @links;

      Inserting the links into a database is left as an exercise for the reader.



      --chargrill
      s**lil*; $*=join'',sort split q**; s;.*;grr; &&s+(.(.)).+$2$1+; $; = qq-$_-;s,.*,ahc,;$,.=chop for split q,,,reverse;print for($,,$;,$*,$/)
        hello wish u a happy new year thanks for the help !!! just want to ask when i set the $max depth variable to 3 or 2 it gives me the same output.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://588081]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others musing on the Monastery: (6)
As of 2015-07-02 00:22 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (25 votes), past polls