in reply to
Re: Any spider framework?
in thread Any spider framework?
You're right. I looked at the source and found this abomination:
s{<a\s+.*?href\=(.*?)>(.*?)</a>}{
...
}isgxe;
Ouch. There are so many ways that this can go wrong: "a" tags with a "name" and no href attribute, whitespace around the "=", ...
There are modules made espacially to extract links from HTML, for example HTML::LinkExtor and HTML::SimpleLinkExtor. Using one of those would have been a much safer approach.
But at least, this module takes "robots.txt" files in consideration, which is the polite thing to do, and probably one of the first things to go in a more naive approach. So that is good.