Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask

fix the problem of the web crawler

by ati (Initiate)
on Nov 08, 2012 at 15:14 UTC ( #1002919=perlquestion: print w/replies, xml ) Need Help??
ati has asked for the wisdom of the Perl Monks concerning the following question:

Since a week ago I used this crawler to have the co-author list of DBLP authors, it worked fine, but it stopped, i don't know where is the problem, because it doesn't shows any error but just doesn't crawls (have a empty file of results) maybe it is the problem on the DBLP page itself. here are the files I've used. Please help to make function this file again because I need to finish a project. please help
#!/usr/bin/perl # This script scrapes co-author info for a specific person from DBLP. # Revision 1 intended use is to supply author name on cmd line, # with a list of co-authors provided as output, one per line. # Revision 2 methodizes the crawler to provide multi-lvl crawling. # Revision 3 encapsulates the crawler in a loop to run against a list +of authors. # Status output is written to STDOUT, errors (such as unfound authors) + written # to STDERR, and coauthor info is written to the file specified belo +w. # Import Perl's WWW library for quick & easy web retrieval. # utf8 allows unicode char in this script, and also import HTML uni +code conversion methods. use utf8; use LWP::Simple; use HTML::Entities; # Inits; $sleep indicates time to wait between each author crawl. my %conflicts = (); my %index = (); my $base_url = ' +tree/'; my $sleep = 0; my $outfile = 'conf.txt.'; my $index_number = 0; my $index_dy = 0; # DBLP full names for some authors otherwise not found in catalog. my %fullnames = ('List of names snipped'); # Open the input data, parsing out reviewer names into a list. Init o +utput file. $filename = "author.txt"; open(INPUT,$filename) or die "Can't open file $filename\n"; undef $/; my $text=<INPUT>; close INPUT; $/ = "\n"; @reviewers = split(/\n/, $text); my $num_reviewers = $#reviewers + 1; # Open output file... use > to start over, >> to continue. open OUTFILE, '>>'.$outfile; close OUTFILE; # Disable buffering on STDOUT so I can see the damn progress log in re +altime. select((select(STDOUT), $|=1)[0]); # Loop over all reviewers, formatting name and calling the Crawler. my $count = 0; foreach my $reviewer (@reviewers) { # Loop inits; clear the conflicts hash. $count++; $index_number++; $index_dy++; next if ($count < 0); # Skip to current guy (or gal). %conflicts = (); print 'Working on ', $reviewer, ', # ', $count, ' of ', $num_review +ers, '... '; # Format reviewer name to match DBLP specs. my $orig_name = $reviewer; $reviewer = encode_entities($reviewer); $reviewer =~ s/[^\w\s]/=/g; my ($first, $middle, $last) = split /\s+/, $reviewer; my $formatted = ''; if (defined $last) { $formatted = $last.':'.$first.'_'.$middle; } else { $formatted = $middle.':'.$first; } $index{$formatted} = $index_dy; $index{$conflicts} = $index_number; # Call the crawler method with formatted name. #&Crawl('Fox:Edward_A='); &Crawl($formatted); # Output the results. open OUTFILE, '>>'.$outfile; foreach my $key (keys %conflicts) { print OUTFILE $index{$formatted}, '=', $orig_name,'=', $conflicts +{$key},' ', "\n"; } my @conflicts = sort keys %conflicts; #$index{$formatted} #$index{$conflicts}, %conflicts = (); foreach my $conflict (@conflicts) { &Crawl($conflict); ($surname, $name) = split /:/,$conflict; open OUTFILE, '>>'.$outfile; } close OUTFILE; # Finished with this $reviewer, wait $sleep seconds before starting +next. print 'done.', "\n"; sleep $sleep; } # Returns a list of co-authors from DBLP. sub Crawl { # Compose author name for retrieval. my $name = shift || die "Bad usage of method Crawl."; #print "\n",$name,"\n"; my $category = lc(substr($name, 0, 1)); #print "\n",$category ,"\n"; # Construct author URL and retrieve summary page. my $url = $base_url.$category.'/'.$name.'.html'; #print "\n",$url ,"\n"; my $page = get($url) || warn "Couldn't get ${url}: $!"; #print "\n",$page ,"\n"; return () unless defined $page; # Find co-authors list at bottom and parse out all names & URLs. while ($page =~ m{<\/td> # First two lines match style code. <td\sclass="coauthor"\salign="right"\sbgcolor="[^"]+ +"> <a\shref="([^"]+)"> # Matches relative li +nk to coauthor page. ([^>]+)<\/a> # Matches co-author + name. }mgx) { # Translate relative URL into an absolute using base address of D +BLP. my $url = $1; my $coauth_name = $2; my ($tmp1, $tmp2, $tmp3) = split '/', $url; my $coauth = $tmp3; $coauth =~ s/.html$//; $coauth_name = decode_entities($coauth_name); # Save this co-author. $conflicts{$coauth} = $coauth_name; if (!exists $index{$coauth}) { $index_number++; $index{$coauth} = $index_number; } } return 0; }
the author file author.txt
James F. Blakesley James F. Blinn James F. Blowey James F. Bowring

Replies are listed 'Best First'.
Re: fix the problem of the web crawler
by frozenwithjoy (Priest) on Nov 08, 2012 at 16:39 UTC
    Three hints (and a suggestion):
    1. The URLs that the script is generating are correct.
    2. The regex doesn't seem to be matching anything because the style code has changed on the website.
    3. There are lots of other potential problems with your script that can be found with use strict; use warnings;
    4. Considering you posted nearly the same wall of script a year ago, it might be worth paying someone to clean it up and make it work properly.

    Edit: Is this how the output (conf.txt.) is supposed to look? (I accept all PayPal alternatives... Just kidding... Sort of... But seriously, if this is the expected output and you follow my hints, you'll figure it out.)

    1=James F. Blakesley=Frederick H. Wolf 1=James F. Blakesley=Keith S. Murray 1=James F. Blakesley=Dagmar Murray 2=James F. Blinn=Turner Whitted 2=James F. Blinn=Pat Hanrahan 2=James F. Blinn=Tomas Porter 2=James F. Blinn=Flip Phillips 2=James F. Blinn=Martin E. Newell 2=James F. Blinn=Jeffrey M. Lane 2=James F. Blinn=Nick England 2=James F. Blinn=Loren C. Carpenter 2=James F. Blinn=Alvy Ray Smith 2=James F. Blinn=Donna J. Cox 2=James F. Blinn=Helga M. Leonardt Hendriks 2=James F. Blinn=Charles T. Loop 2=James F. Blinn=Rob Pike 2=James F. Blinn=Richard Ellison 3=James F. Blowey=John W. Barrett 3=James F. Blowey=Stephen Langdon 3=James F. Blowey=John R. King 4=James F. Bowring=Mary Jean Harrold 4=James F. Bowring=James M. Rehg 4=James F. Bowring=Alessandro Orso 4=James F. Bowring=James A. Jones

      It is the exact output I've had before.

        Here are a couple more (very specific) hints:
        1. Uncomment out the print page line so you can see the content you are scraping (or just go to the appropriate URL and view source).
        2. Change this part of the regex since it is apparently out-of-date: <td\sclass="coauthor"\salign="right"\sbgcolor="[^"]+">

        Also, I don't mean to be a jerk, but it is really better for you if you work through this yourself. Instead of sending me messages, you should show what you are trying here and people will be more willing to help when they've seen that you are indeed making a noble effort. Like the ancient saying goes: "Monks help those that help themselves!"

Re: fix the problem of the web crawler
by bitingduck (Chaplain) on Nov 08, 2012 at 16:29 UTC

    In my limited experience with screenscrapers, that failure mode is usually caused by someone at the other end changing the formatting. You're looking for the stuff you want with a regex, rather than an html parser, so you can easily be a victim of very minor changes in the html. Your best bet is probably about 10 minutes of looking at the page source and then revising the regex accordingly. Switching to using HTML::TreeBuilder and taking advantage of predictable page structure and tag attributes might make your script a little more robust (or it might not, depending on who is messing with it at the other end...). I have a scraper that's been running reliably for several years now through a number of changes in the target page's display format since I switched to treebuilder.

Re: fix the problem of the web crawler
by zwon (Abbot) on Nov 08, 2012 at 15:28 UTC

    If you think this is a place there you're posting your non-working scripts and they get fixed, it is not. There's a lot of freelance websites where you can find a programmer to fix your problem.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1002919]
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others scrutinizing the Monastery: (6)
As of 2017-05-29 15:56 GMT
Find Nodes?
    Voting Booth?