Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Re^2: Problem in using Web::Scraper, coming from HTML::TreeBuilder::XPath

by sbasbasba (Initiate)
on Oct 06, 2013 at 03:26 UTC ( #1057101=note: print w/ replies, xml ) Need Help??

Comment on Re^2: Problem in using Web::Scraper, coming from HTML::TreeBuilder::XPath
Replies are listed 'Best First'.
Re^3: Problem in using Web::Scraper, coming from HTML::TreeBuilder::XPath
by Anonymous Monk on Oct 06, 2013 at 07:16 UTC
    Well, "it works" for me in that it doesn't die with no kind of error message :) but doesn't work because the html isn't what you think it is :) so no info is extracted
    #!/usr/bin/perl -- use strict; use warnings; use WWW::Mechanize 1.73; use Web::Scraper 0.37; use Data::Dump; my $out = scraper { process ".gs_rt", "title[]" => scraper { process ".gs_a", "info" => 'TEXT'; process q{gs_a}, "info" => 'TEXT'; }; }; my $mech = WWW::Mechanize->new(qw/ autocheck 1 /); $mech->show_progress(1); $mech->get( "http://scholar.google.it/scholar?hl=en&q=Handbuch+der+biologischen+Ar +beitsmethoden" ); if( $mech->follow_link( url_regex => qr/cites/i, n => 1 ) ){ my $result = $mech->content; my $indi = $mech->uri(); my $res = $out->scrape( $result, $indi ); #~ dd( $result, $res ); dd( $res ); } __END__ $ perl web-scraper-google-pm1057095.pl ** GET http://scholar.google.it/scholar?hl=en&q=Handbuch+der+biologisc +hen+Arbeitsmethoden ==> 200 OK (1s) ** GET http://scholar.google.it/scholar?cites=3692889479872081319&as_s +dt=2005&sciodt=0,5&hl=en&oe=ASCII ==> 200 OK { title => [{}, {}, {}, {}, {}, {}, {}, {}, {}, {}] }

    If you want to fixup your 'css paths' use htmltreexpather.pl / xpather.pl , compare the hierarchy

    HTML::Element=HASH(0xcac644) 0.1.0.8.0.1.2.0.0 The fire of life. An introduction to animal energetics. /html/body/div/div[5]/div/div[2]/div[2]/div/h3 //div[@id='gs_ccl']/div[2]/div/h3 //div[@id='gs_ccl']/div[@style='z-index:400' and @class='gs_r']/div[@c +lass='gs_ri']/h3[@class='gs_rt'] ------------------------------------------------------------------ HTML::Element=HASH(0xcac534) 0.1.0.8.0.1.2.0.1 M Kleiber - The fire of life. An introduction to animal energetics., 1 +961 - cabdirect.org /html/body/div/div[5]/div/div[2]/div[2]/div/div //div[@id='gs_ccl']/div[2]/div/div //div[@id='gs_ccl']/div[@style='z-index:400' and @class='gs_r']/div[@c +lass='gs_ri']/div[@class='gs_a'] ------------------------------------------------------------------

    gs_a is not a child of gs_rt, they're siblings, they're bot children of gs_ri

    //div[@class='gs_r']/div[@class='gs_ri']/h3[@class='gs_rt'] //div[@class='gs_r']/div[@class='gs_ri']/div[@class='gs_a']

      I was finally able to run the code without errors!! Turns out that I was missing some extra files in the HTML::Treebuilder folder.

      However, I am still not able to print results on a text file. I am using your code, and at the end:

      for my $out (@{$res->{out}}) { print F3 "$out->{title} $out->{info} $out->{info}\n"; }

      to print the file (F3 is the output file). Which is similar to what I found on the CPAN website (example with the tweets). But I get something like "HASH(0x100d35ef0)" in my text file, and that's all: no data. What am I doing wrong?

      Many, many, many thanks!!!

        What am I doing wrong?

        I don't know, I can't tell what you're doing. or you're not reading closely enough what I have written.

      I Don't know why but I keep on seeing that message :(

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1057101]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others contemplating the Monastery: (14)
As of 2015-07-30 14:56 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (271 votes), past polls