|P is for Practical|
Re^5: How to extract links from a webpage and store them in a mysql databaseby chargrill (Parson)
|on Dec 21, 2006 at 13:26 UTC||Need Help??|
And now a second bit of help, possibly a lot bigger of a bit than previously.
I'm not familiar with HTML::LinkExtor, and I really don't use LWP::UserAgent these days either, so I wrote something taking advantage of my personal favorite for anything webpage related, WWW::Mechanize.
I also never quite understood your original algorithm. If it were me (and in this case it is) I'd keep track of urls (and weeding out duplicates) for a given link depth on my own, in my own data structure, as opposed to inserting things into a database and fetching them back out to re-crawl them.
I'm also not clear on your specs as to whether or not you want urls that are off-site. The logic for the way this program handles that is pretty clearly documented, so if it isn't to your spec, adjust it.
Having said all that, here is a recursive link crawler. (Though now that I type out "recursive link crawler", I can't help but imagine that this hasn't been done before, and I'm certain a search would turn one up fairly quickly. Oh well.)
Inserting the links into a database is left as an exercise for the reader.