Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Re^4: Any spider framework?

by bart (Canon)
on Jan 10, 2012 at 08:07 UTC ( #947109=note: print w/ replies, xml ) Need Help??


in reply to Re^3: Any spider framework?
in thread Any spider framework?

In the case of <a name="foo"> it simply won't match, as the regexp includes href.
And what makes you think the regex would limit itself to a single tag? In your example, the "<a " could be matched while the "href=" would be much further down in the document. In fact, there is no guarantee that that this string is a tag attribute, it could just be in plain html text ("PCDATA"), Javascript code, or even in HTML comments.

To be reliable, a parser (actually just a lexer; it could be regex based) should extract whole tags, and you should then test each on its own. That would be much more reliable.


Comment on Re^4: Any spider framework?
Select or Download Code

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://947109]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others having an uproarious good time at the Monastery: (5)
As of 2014-08-31 06:44 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The best computer themed movie is:











    Results (294 votes), past polls