Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight

Re^3: fix the problem of the web crawler

by frozenwithjoy (Priest)
on Nov 08, 2012 at 21:44 UTC ( #1002995=note: print w/replies, xml ) Need Help??

in reply to Re^2: fix the problem of the web crawler
in thread fix the problem of the web crawler

Here are a couple more (very specific) hints:
  1. Uncomment out the print page line so you can see the content you are scraping (or just go to the appropriate URL and view source).
  2. Change this part of the regex since it is apparently out-of-date: <td\sclass="coauthor"\salign="right"\sbgcolor="[^"]+">

Also, I don't mean to be a jerk, but it is really better for you if you work through this yourself. Instead of sending me messages, you should show what you are trying here and people will be more willing to help when they've seen that you are indeed making a noble effort. Like the ancient saying goes: "Monks help those that help themselves!"

Replies are listed 'Best First'.
Re^4: fix the problem of the web crawler
by ati (Initiate) on Nov 09, 2012 at 14:55 UTC

    thanks to you, I almost found the error of the regex but because there are different styles on the text authors there are crawled just the authors who match the firs style, they with the different one does not. I need to make any union of two regex expresions to take both of them.

    <td\sclass="coauthor"\sstyle="text-align:right;background:[^"]+"><a\sh +ref="([^"]+)">([^>]+)<\/a>

    here to put any union or "and" expression


    I mean between of this two parts it is needed any union expresion(I don't know what to put), because with "or" | it takes still just the first and authors with the second style does not match.. Am I right, or not? Any suggestion?

      In order to grab things matching both of these formats (and to catch any future style variations), you can sort of just ignore the style information. So, you know that you definitely want to match <td\sclass="coauthor"as well as ><a\shref="([^"]+)">([^>]+)<\/a>, but you almost don't care about what is in between, right?

      The reason I say "almost don't care" is because you want to match everything EXCEPT a closing '>' to make sure your regex doesn't match too much. [^>]* matches 0 or more characters that are not the character '>'.

        I solved it, thousand thanks to you. now it works as before. but there is another problem, if you can suggest me what I can do. the authors with names of more than 3 parts are not crawled, and authors containing ( ' , .Jr, II, III). example this authors:

        Norie De La Cruz Norm O'Neill Norman L. Guinasso Jr. Norris Milton II Northrup Fowler III Noor Asna Fazli Abdul Samad N. S. S. S. N. Usha Devi Niels H. M. Aan de Brugh

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1002995]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (2)
As of 2018-08-18 21:43 GMT
Find Nodes?
    Voting Booth?
    Asked to put a square peg in a round hole, I would:

    Results (186 votes). Check out past polls.