Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Re: fix the problem of the web crawler

by frozenwithjoy (Curate)
on Nov 08, 2012 at 16:39 UTC ( #1002945=note: print w/ replies, xml ) Need Help??


in reply to fix the problem of the web crawler

Three hints (and a suggestion):

  1. The URLs that the script is generating are correct.
  2. The regex doesn't seem to be matching anything because the style code has changed on the website.
  3. There are lots of other potential problems with your script that can be found with use strict; use warnings;
  4. Considering you posted nearly the same wall of script a year ago, it might be worth paying someone to clean it up and make it work properly.

Edit: Is this how the output (conf.txt.) is supposed to look? (I accept all PayPal alternatives... Just kidding... Sort of... But seriously, if this is the expected output and you follow my hints, you'll figure it out.)

1=James F. Blakesley=Frederick H. Wolf 1=James F. Blakesley=Keith S. Murray 1=James F. Blakesley=Dagmar Murray 2=James F. Blinn=Turner Whitted 2=James F. Blinn=Pat Hanrahan 2=James F. Blinn=Tomas Porter 2=James F. Blinn=Flip Phillips 2=James F. Blinn=Martin E. Newell 2=James F. Blinn=Jeffrey M. Lane 2=James F. Blinn=Nick England 2=James F. Blinn=Loren C. Carpenter 2=James F. Blinn=Alvy Ray Smith 2=James F. Blinn=Donna J. Cox 2=James F. Blinn=Helga M. Leonardt Hendriks 2=James F. Blinn=Charles T. Loop 2=James F. Blinn=Rob Pike 2=James F. Blinn=Richard Ellison 3=James F. Blowey=John W. Barrett 3=James F. Blowey=Stephen Langdon 3=James F. Blowey=John R. King 4=James F. Bowring=Mary Jean Harrold 4=James F. Bowring=James M. Rehg 4=James F. Bowring=Alessandro Orso 4=James F. Bowring=James A. Jones


Comment on Re: fix the problem of the web crawler
Select or Download Code
Re^2: fix the problem of the web crawler
by ati (Initiate) on Nov 08, 2012 at 18:11 UTC

    It is the exact output I've had before.

      Here are a couple more (very specific) hints:
      1. Uncomment out the print page line so you can see the content you are scraping (or just go to the appropriate URL and view source).
      2. Change this part of the regex since it is apparently out-of-date: <td\sclass="coauthor"\salign="right"\sbgcolor="[^"]+">

      Also, I don't mean to be a jerk, but it is really better for you if you work through this yourself. Instead of sending me messages, you should show what you are trying here and people will be more willing to help when they've seen that you are indeed making a noble effort. Like the ancient saying goes: "Monks help those that help themselves!"

        thanks to you, I almost found the error of the regex but because there are different styles on the text authors there are crawled just the authors who match the firs style, they with the different one does not. I need to make any union of two regex expresions to take both of them.

        <td\sclass="coauthor"\sstyle="text-align:right;background:[^"]+"><a\sh +ref="([^"]+)">([^>]+)<\/a>

        here to put any union or "and" expression

        <td\sclass="coauthor"\sstyle="text-align:right;"><a\shref="([^"]+)">([^>]+)<\/a>

        I mean between of this two parts it is needed any union expresion(I don't know what to put), because with "or" | it takes still just the first and authors with the second style does not match.. Am I right, or not? Any suggestion?

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1002945]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others contemplating the Monastery: (6)
As of 2014-12-20 18:59 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (97 votes), past polls