Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

Problem in using Web::Scraper, coming from HTML::TreeBuilder::XPath

by sbasbasba (Initiate)
on Oct 06, 2013 at 00:22 UTC ( #1057095=perlquestion: print w/ replies, xml ) Need Help??
sbasbasba has asked for the wisdom of the Perl Monks concerning the following question:

Hi all, thanks in advance for all the precious knowledge you've been sharing so far!

I am a newbie at Perl, and I am trying to write a script that:

1) searches Google Scholar for some keywords stored in a text file;

2) opens the first "Cited by..." link that appears in the results;

3) scrapes all the following search page (Name, info, number of citations of the papers).

This is what I wrote so far:

### #!/usr/bin/perl use strict; use warnings; use WWW::Mechanize; use LWP::UserAgent; use Web::Scraper; my $mech = WWW::Mechanize->new(); $mech->get("http://scholar.google.it/scholar?hl=en&q=Handbuch+der+biol +ogischen+Arbeitsmethoden"); my $response = $mech->follow_link( url_regex => qr/cites/i, n=>1 ); + my $result = $response->decoded_content; my $indi = $mech->uri(); open (F3,'>'results.txt') or die "$!"; my $out = scraper{ process ".gs_rt", "title[]" => scraper { process ".gs_a", "info" => 'TEXT'; process ".gs_fl", "cites" => 'TEXT'; }; }; my $res = $out->scrape($result, $indi); for my $out (@{$res->{out}}) { print F3 "$out->{title} $out->{info} $out->{info}\n"; } sleep(3); close(F3);

The line:

my $res = $out->scrape($result, $indi);

however, gives me the following error:

Can't locate object method "new" via package "HTML::TreeBuilder::XPath" at /System/Library/Perl/Extras/5.10.0/Web/Scraper.pm line 115, <F1> line 1.

I have searched the Internet and found no answer, I updated my version of XPath, I tried to use scrape(URI->($indi)); but nothing works. I am quite desperate! I have the feeling that there is a bug in the XPath.pm file, because I have been following exactly the same scraping code that I see in the CPAN guide for WEB::Scraper. Nothing seems to work.

If you could help me, you would have my eternal gratitude.

Thanks a lot in advance!!

Comment on Problem in using Web::Scraper, coming from HTML::TreeBuilder::XPath
Select or Download Code
Re: Problem in using Web::Scraper, coming from HTML::TreeBuilder::XPath
by Anonymous Monk on Oct 06, 2013 at 00:50 UTC
    Can you reproduce the error without all the loops and file manipulation stuff? (trim trim trim your code)

      Sure, I trimmed the code above.

        Well, "it works" for me in that it doesn't die with no kind of error message :) but doesn't work because the html isn't what you think it is :) so no info is extracted
        #!/usr/bin/perl -- use strict; use warnings; use WWW::Mechanize 1.73; use Web::Scraper 0.37; use Data::Dump; my $out = scraper { process ".gs_rt", "title[]" => scraper { process ".gs_a", "info" => 'TEXT'; process q{gs_a}, "info" => 'TEXT'; }; }; my $mech = WWW::Mechanize->new(qw/ autocheck 1 /); $mech->show_progress(1); $mech->get( "http://scholar.google.it/scholar?hl=en&q=Handbuch+der+biologischen+Ar +beitsmethoden" ); if( $mech->follow_link( url_regex => qr/cites/i, n => 1 ) ){ my $result = $mech->content; my $indi = $mech->uri(); my $res = $out->scrape( $result, $indi ); #~ dd( $result, $res ); dd( $res ); } __END__ $ perl web-scraper-google-pm1057095.pl ** GET http://scholar.google.it/scholar?hl=en&q=Handbuch+der+biologisc +hen+Arbeitsmethoden ==> 200 OK (1s) ** GET http://scholar.google.it/scholar?cites=3692889479872081319&as_s +dt=2005&sciodt=0,5&hl=en&oe=ASCII ==> 200 OK { title => [{}, {}, {}, {}, {}, {}, {}, {}, {}, {}] }

        If you want to fixup your 'css paths' use htmltreexpather.pl / xpather.pl , compare the hierarchy

        HTML::Element=HASH(0xcac644) 0.1.0.8.0.1.2.0.0 The fire of life. An introduction to animal energetics. /html/body/div/div[5]/div/div[2]/div[2]/div/h3 //div[@id='gs_ccl']/div[2]/div/h3 //div[@id='gs_ccl']/div[@style='z-index:400' and @class='gs_r']/div[@c +lass='gs_ri']/h3[@class='gs_rt'] ------------------------------------------------------------------ HTML::Element=HASH(0xcac534) 0.1.0.8.0.1.2.0.1 M Kleiber - The fire of life. An introduction to animal energetics., 1 +961 - cabdirect.org /html/body/div/div[5]/div/div[2]/div[2]/div/div //div[@id='gs_ccl']/div[2]/div/div //div[@id='gs_ccl']/div[@style='z-index:400' and @class='gs_r']/div[@c +lass='gs_ri']/div[@class='gs_a'] ------------------------------------------------------------------

        gs_a is not a child of gs_rt, they're siblings, they're bot children of gs_ri

        //div[@class='gs_r']/div[@class='gs_ri']/h3[@class='gs_rt'] //div[@class='gs_r']/div[@class='gs_ri']/div[@class='gs_a']
Re: Problem in using Web::Scraper, coming from HTML::TreeBuilder::XPath
by kcott (Abbot) on Oct 06, 2013 at 05:47 UTC

    G'day sbasbasba,

    Welcome to the monastery.

    "Can't locate object method "new" via package "HTML::TreeBuilder::XPath" at /System/Library/Perl/Extras/5.10.0/Web/Scraper.pm line 115, <F1> line 1."

    That pathname (/System/Library/Perl/...) indicates your OS is Mac OS X and you're using the system Perl (i.e. the version of Perl installed by Apple for its own use). You have Web::Scraper installed in /System/Library/Perl/Extras/5.10.0/: this means you've modified your system Perl; I don't know what other modifications you've made. It's generally not a good idea to alter the system Perl. See the responses to "Are there any major Perl issues with Mac OS X Lion?": I posted this question a couple of years ago when I first started using Perl on a Mac; I chose the perlbrew option (and have no problems after 2 years of use and multiple Perl upgrades). I'd recommend you look into perlbrew or an equivalent solution.

    Your current problem is probably related to if, and where, you have HTML::TreeBuilder::XPath installed. You may also have other versions of Perl installed. Without more information, I can only provide troubleshooting tips:

    • The shebang line at the start of your script (#!/usr/bin/perl) indicates that the system Perl (/usr/bin/perl) should be used to run the script. Use the "which perl" command to see if that's the default Perl: you may get something like /opt/local/bin/perl if you've installed MacPorts.
    • Find out where your Perl module libraries are. "perl -V" will list these under @INC: for the default Perl; for a specific Perl, use a full pathname, e.g. "/opt/local/bin/perl -V".
    • Determine how you're installing Perl modules. cpan is a fairly typical utility for this. Use "which cpan" to see the full path and compare with paths to perl.
    • Search for .../HTML/TreeBuilder/XPath.pm on your system. Assuming it has been installed, it's probably in one of the paths listed under @INC:. If you can't find it, install it; if it's in an unexpected place, don't try to copy or move it, reinstall it.
    • Your problem might be fixed by changing your shebang line to whatever your default Perl is. For all my scripts, I use "#!/usr/bin/env perl" which automatically uses the current default.

    [Aside: I noticed you removed part of your original post and replaced it with new content. Please don't do this: it often invalidates comments already made; it can also be useful to subsequent readers to see what was considered and then discarded (i.e. others can learn from your mistakes). The correct way to deal with this is described in "How do I change/delete my post?".]

    -- Ken

      Hi All,

      Thanks for the great answers, and sorry for not having followed the right etiquette of this website. I am not a programmer, and I started using perl only two weeks ago.

      So: I have tried to locate where perl is installed: /usr/bin/perl. However, I also see perl5.8.9 and 5.10.0. Furthermore, for some reasons cpan won't let me install modules, so I have basically been copy-pasting the source of the .pm module files I needed, and put them in the relative folders (i.e. HTML/TreeBuilder folder for XPath.pm). Most of these folders, I have no idea why, are located in /System/Library/Perl/Extras/5.10.0/, so that's where I put the files. Both XPath and Scraper are in subfolders of this folder. I know this is not how I should work, but I just couldn't figure out anything better when I started working and only now I realize that I probably made a huge mess.

      What would you guys recommend to solve this problem? Where should the folders be located?

      Thanks a lot for your precious help!

        What would you guys recommend to solve this problem? Where should the folders be located?

        step 1) restore your system perl to the pristine condition that it was ; I can't help with exact steps, its macness :)

        step 2a) maintain your own perl (easy) ( install CitrusPerl), this way you get a newer version of perl, and if apple-macness updates the system perl, you don't have to reinstall/recompile any modules, your perl remains untouched

        step 2b) or maintain your own PERL5LIB with cpanm (like this cpanm --local-lib PERL5LIB export PERL_MB_OPT=--install_base /home/user/devstuff set PERL_MM_OPT=INSTALL_BASE=/home/user/devstuff)

        get cpanm and install Web::Scraper into /home/username/myperllibs

        curl -L http://cpanmin.us | perl - -v --local-lib /home/username/myperllibs App::cpanminus ExtUtils::MakeMaker Module::Build Web::Scraper

        wget -O - http://cpanmin.us | perl - -v --local-lib /home/username/myperllibs App::cpanminus ExtUtils::MakeMaker Module::Build Web::Scraper

        Then you can run perl -I/home/username/myperllibs/lib/perl5/CGI mypythonic.pl or

        export PERL5LIB=/home/username/myperllibs/lib/perl5:$PERL5LIB export PATH=/home/username/myperllibs/bin;$PATH

        perlbrew or perlall are workable alternatives somewhat automating one or more of the above steps

        But when apple-macness updates the system perl, you'll have to recompile/reinstall any modules with binary components(.so/.xs files) not provided by apple-macness (anything with .xs/.sp files in your myperllibs)

        step 3) If you were using a "word processor" switch to a "notepad" equivalent, or a programmers editor ( scite, textpad, gvim, emacs, padre )

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1057095]
Approved by ww
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others about the Monastery: (7)
As of 2014-07-12 22:19 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    When choosing user names for websites, I prefer to use:








    Results (241 votes), past polls