Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

Re: How to extract links from a webpage and store them in a mysql database

by g0n (Priest)
on Dec 05, 2006 at 12:53 UTC ( #587856=note: print w/ replies, xml ) Need Help??


in reply to How to extract links from a webpage and store them in a mysql database

Hi syedahmed.uos,

Running your code and passing it a single URL, I get a single row in the table, with the base in column 'webpage', and a space separated set of fully qualified links in the 'links' column.

You say that you want to have each link in a separate row of the table. To do that, you'll need to iterate over the list of links. Something like this:

for my $link (@a) { $dbh->do("INSERT INTO htmllinks VALUES ('$base',' $link')"); }

Just making that change won't work because you've set 'webpage' to be the primary key, so you'll get an error because the value for 'webpage' is the same for every row. If you take the 'primary key' out of your table create, and put the loop in as above, you'll get something like this:

+---------------------------------------------------------+ | webpage | htmllinks | +---------------------------------------------------------+ |http://myurl.com | http://myurl.com/link_one/ | |http://myurl.com | http://myurl.com/link_two/ | +---------------------------------------------------------+

Is that what you're looking for?

--------------------------------------------------------------

"If there is such a phenomenon as absolute evil, it consists in treating another human being as a thing."
John Brunner, "The Shockwave Rider".


Comment on Re: How to extract links from a webpage and store them in a mysql database
Select or Download Code
Re^2: How to extract links from a webpage and store them in a mysql database
by syedahmed.uos (Novice) on Dec 05, 2006 at 14:34 UTC
    thanx for the reply.... what i really want is to extract the links from the links on the main webpage.there should be some way to restrict the crawler to links inside the original domain, or you could potentially head out and start crawling the entire web!the depth level or depth limit should be set to 3,i.e.,extract the links, and repeat the process with each of those links down to three levels , and the all the extracted links should be stored in one column in the database. thanks
      Step one is probably to write an algorithm to do what you want. Something like this perhaps:

      • Create your database table with columns for 'link', 'depth', 'read'
      • read the first page and store the base URL
      • for each link in the page, compare its base to the original base URL
      • If they match, add to the DB with depth 2 and read 'no'
      • For each entry in the table where read eq 'no', read the page, set read to 'yes', compare each link base to the original base URL
      • If they match, add to the db with depth 3 and read 'no'
      • repeat the last two steps, setting depth to 4 (i.e. a link found at depth 3)
      • end
      You could end when you don't find any entries in the db with depth <=3 and read eq 'no', that way it's easy to modify if you decide to read deeper.

      --------------------------------------------------------------

      "If there is such a phenomenon as absolute evil, it consists in treating another human being as a thing."
      John Brunner, "The Shockwave Rider".

        1) create database with three columns for 'depth level' 'link' 'read' 2) depth level = 0 Base url; 3) while $# array_urls 4) write to database ("depth level, $array_url") 5) $depth ++ 6) trigger the db 7) read each link from database and extract further links found, to the databse. 8) until depth level = 3; 9) stop!
        I have to write a program that will repeatedly extract links to a depth level of 3. i.e., after extracting links the first time i store the links in the mysql database. then i fetch these links and iterate the function over each link to extract more links.. so i have to do three times. I have stored and then fetched the links from the database i aint able to run the function on each link and store them back to the database. the code is shown previously.
        #!/usr/bin/perl use LWP::UserAgent; use HTML::LinkExtor; use URI::URL; $url =<>;; # for instance #my $depth = 0; my @link =(); my $ua = LWP::UserAgent->new; # Set up a callback that collect links my @a = (); sub callback { my($tag, %attr) = @_; return if $tag ne 'a'; push(@a,values %attr); } # Make the parser.Unfortunately, we don't know the base yet (it mig +ht be #diffent from $url) my $p = HTML::LinkExtor->new(\&callback); my $res = $ua->request(HTTP::Request->new(GET => $url), sub {$p->parse($_[0])}); # Expand all image URLs to absolute ones my $base = $res->base; @a = map { $_ = url($_, $base)->abs; } @a; # Print them out print join("\n", @a), "\n"; use strict; use DBI(); my $dbh = DBI->connect("DBI:mysql:database=gatxp;host="","",""); #$dbh->do("CREATE TABLE newlinks (md5 INTEGER(100) not null primary k +ey, webpage VARCHAR(80) not null)"); $dbh->do("INSERT INTO newlinks VALUES ('MD5','0','$base','1')"); foreach $a(@a){ $dbh->do ("INSERT INTO newlinks VALUES ('','1','$a','0')"); } my $sth = $dbh->prepare('SELECT * FROM newlinks') or die "Couldn't prepare statement: " . $dbh->errstr; $sth->execute(); while( my $ref = $sth->fetchrow_hashref()){ my $link=$ref->{'webpage'}; foreach $link(@link){ my $usa = LWP::UserAgent->new; $p = HTML::LinkExtor->new(\&callback); my $res = $usa->request(HTTP::Request->new(GET => $link), sub {$p->parse($_[0])}); $base = $res->base; @link = map { $_ = url($_, $base)->abs; } @link; # Print them out print "$$link\n"; $sth->finish(); $dbh->disconnect();

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://587856]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others rifling through the Monastery: (6)
As of 2014-10-21 05:58 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    For retirement, I am banking on:










    Results (96 votes), past polls