Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

having no result

by ati (Initiate)
on Nov 15, 2011 at 15:17 UTC ( [id://938195]=perlquestion: print w/replies, xml ) Need Help??

ati has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I'm new to perl. I started to work a project and found this code which is very helpful for me if I have the result of this code. so write if somebody has a result of this code. it is about crawling the dblp and finding coauthors. the code i found here http://dconry.blogspot.com/2007/08/dblp-crawler-code.html the input required is the code below (perl file) and a csv file (below) with the names of the authors. i've debugged and the the output file is created (conflict.txt) but it is empty, so i have empty results.. here it is the code. thank you

#!/usr/bin/perl # This script scrapes co-author info for a specific person from DBLP. # Revision 1 intended use is to supply author name on cmd line, # with a list of co-authors provided as output, one per line. # Revision 2 methodizes the crawler to provide multi-lvl crawling. # Revision 3 encapsulates the crawler in a loop to run against a list +of authors. # Status output is written to STDOUT, errors (such as unfound authors) + written # to STDERR, and coauthor info is written to the file specified belo +w. # Import Perl's WWW library for quick & easy web retrieval. # utf8 allows unicode char in this script, and also import HTML uni +code conversion methods. use utf8; use LWP::Simple; use HTML::Entities; # Inits; $sleep indicates time to wait between each author crawl. my %conflicts = (); my $base_url = 'http://www.sigmod.org/dblp/db/indices/a-tree/'; my $sleep = 5; my $outfile = 'conflicts.txt'; # DBLP full names for some authors otherwise not found in catalog. my %fullnames = ('List of names snipped'); # Open the input data, parsing out reviewer names into a list. Init o +utput file. open FILE, 'PapersReviewers.csv'; my $line = shift || die "Empty input file."; close FILE; chomp $line; my (undef, @reviewers) = split ',', $line; my $num_reviewers = $#reviewers + 1; # Open output file... use > to start over, >> to continue. open OUTFILE, '>>'.$outfile; close OUTFILE; # Disable buffering on STDOUT so I can see the damn progress log in re +altime. select((select(STDOUT), $|=1)[0]); # Loop over all reviewers, formatting name and calling the Crawler. my $count = 0; foreach my $reviewer (sort @reviewers) { # Loop inits; clear the conflicts hash. $count++; next if ($count < 0); # Skip to current guy (or gal). %conflicts = (); print 'Working on ', $reviewer, ', # ', $count, ' of ', $num_review +ers, '... '; # Use alternative DBLP name if it exists. if (exists $fullnames{$reviewer}) { $reviewer = $fullnames{$reviewer}; print 'using alternative form of name ', $reviewer, '... '; } # Format reviewer name to match DBLP specs. my $orig_name = $reviewer; $reviewer = encode_entities($reviewer); $reviewer =~ s/[^\w\s]/=/g; my ($first, $middle, $last) = split /\s+/, $reviewer; my $formatted = ''; if (defined $last) { $formatted = $last.':'.$first.'_'.$middle; } else { $formatted = $middle.':'.$first; } # Call the crawler method with formatted name. #&Crawl('Fox:Edward_A='); &Crawl($formatted); # Results of 1st level crawl are crawled in turn (2nd lvl). my @conflicts = sort keys %conflicts; foreach my $conflict (@conflicts) { &Crawl($conflict); } # Output the results. open OUTFILE, '>>'.$outfile; print OUTFILE $orig_name, ','; foreach my $key (sort keys %conflicts) { print OUTFILE $conflicts{$key}, ','; } print OUTFILE "\n"; close OUTFILE; # Finished with this $reviewer, wait $sleep seconds before starting +next. print 'done.', "\n"; sleep $sleep; } # Returns a list of co-authors from DBLP. sub Crawl { # Compose author name for retrieval. my $name = shift || die "Bad usage of method Crawl."; my $category = lc(substr($name, 0, 1)); # Construct author URL and retrieve summary page. my $url = $base_url.$category.'/'.$name.'.html'; my $page = get($url) || warn "Couldn't get ${url}: $!"; return () unless defined $page; # Find co-authors list at bottom and parse out all names & URLs. while ($page =~ m{\d+<\/td> # First two lines match style code. <td\salign="right"\sbgcolor="[^"]+"> <a\shref="([^"]+)"> # Matches relative li +nk to coauthor page. ([^>]+)<\/a> # Matches co-autho +r name. }mgx) { # Translate relative URL into an absolute using base address of D +BLP. my $url = $1; my $coauth_name = $2; my ($tmp1, $tmp2, $tmp3) = split '/', $url; my $coauth = $tmp3; $coauth =~ s/.html$//; $coauth_name = decode_entities($coauth_name); # Save this co-author. $conflicts{$coauth} = $coauth_name; } return 0; }

here it is the csv file how it should look, with the name PapersReviewers.csv ..sory if I'm not so clear with the explanation

Wa'el H. 'Awad E. N. M. Nolte 'Hoen Indrani Goswami (Chakraborty) Asi Elad (Elbaz) Seyed H. Haeri (Hossein) Jonah Z. Loeb (Lavi) Miri Priesler (Moreno) Venkata Reddy Muppani (Muppant) Shlomo Shamai (Shitz) Atmanand M. A Ibrahim M. W. A Image Processing de la Cruz A Liya A Manoj Prasad A Moqry A Nguyen Q. A

Replies are listed 'Best First'.
Re: having no result
by marto (Cardinal) on Nov 15, 2011 at 15:29 UTC

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://938195]
Approved by sundialsvc4
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others musing on the Monastery: (4)
As of 2024-04-18 00:30 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found