Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

Re^5: fix the problem of the web crawler

by frozenwithjoy (Curate)
on Nov 09, 2012 at 15:29 UTC ( #1003163=note: print w/ replies, xml ) Need Help??


in reply to Re^4: fix the problem of the web crawler
in thread fix the problem of the web crawler

In order to grab things matching both of these formats (and to catch any future style variations), you can sort of just ignore the style information. So, you know that you definitely want to match <td\sclass="coauthor"as well as ><a\shref="([^"]+)">([^>]+)<\/a>, but you almost don't care about what is in between, right?

The reason I say "almost don't care" is because you want to match everything EXCEPT a closing '>' to make sure your regex doesn't match too much. [^>]* matches 0 or more characters that are not the character '>'.


Comment on Re^5: fix the problem of the web crawler
Select or Download Code
Re^6: fix the problem of the web crawler
by ati (Initiate) on Nov 10, 2012 at 16:04 UTC

    I solved it, thousand thanks to you. now it works as before. but there is another problem, if you can suggest me what I can do. the authors with names of more than 3 parts are not crawled, and authors containing ( ' , .Jr, II, III). example this authors:

    Norie De La Cruz Norm O'Neill Norman L. Guinasso Jr. Norris Milton II Northrup Fowler III Noor Asna Fazli Abdul Samad N. S. S. S. N. Usha Devi Niels H. M. Aan de Brugh

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1003163]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chanting in the Monastery: (13)
As of 2014-09-19 18:50 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (144 votes), past polls