in reply to Re^4: fix the problem of the web crawler
in thread fix the problem of the web crawler
In order to grab things matching both of these formats (and to catch any future style variations), you can sort of just ignore the style information. So, you know that you definitely want to match <td\sclass="coauthor"as well as ><a\shref="([^"]+)">([^>]+)<\/a>, but you almost don't care about what is in between, right?
The reason I say "almost don't care" is because you want to match everything EXCEPT a closing '>' to make sure your regex doesn't match too much. [^>]* matches 0 or more characters that are not the character '>'.