Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

Re^3: Problem in using Web::Scraper, coming from HTML::TreeBuilder::XPath

by Anonymous Monk
on Oct 06, 2013 at 07:16 UTC ( #1057126=note: print w/ replies, xml ) Need Help??


in reply to Re^2: Problem in using Web::Scraper, coming from HTML::TreeBuilder::XPath
in thread Problem in using Web::Scraper, coming from HTML::TreeBuilder::XPath

Well, "it works" for me in that it doesn't die with no kind of error message :) but doesn't work because the html isn't what you think it is :) so no info is extracted

#!/usr/bin/perl -- use strict; use warnings; use WWW::Mechanize 1.73; use Web::Scraper 0.37; use Data::Dump; my $out = scraper { process ".gs_rt", "title[]" => scraper { process ".gs_a", "info" => 'TEXT'; process q{gs_a}, "info" => 'TEXT'; }; }; my $mech = WWW::Mechanize->new(qw/ autocheck 1 /); $mech->show_progress(1); $mech->get( "http://scholar.google.it/scholar?hl=en&q=Handbuch+der+biologischen+Ar +beitsmethoden" ); if( $mech->follow_link( url_regex => qr/cites/i, n => 1 ) ){ my $result = $mech->content; my $indi = $mech->uri(); my $res = $out->scrape( $result, $indi ); #~ dd( $result, $res ); dd( $res ); } __END__ $ perl web-scraper-google-pm1057095.pl ** GET http://scholar.google.it/scholar?hl=en&q=Handbuch+der+biologisc +hen+Arbeitsmethoden ==> 200 OK (1s) ** GET http://scholar.google.it/scholar?cites=3692889479872081319&as_s +dt=2005&sciodt=0,5&hl=en&oe=ASCII ==> 200 OK { title => [{}, {}, {}, {}, {}, {}, {}, {}, {}, {}] }

If you want to fixup your 'css paths' use htmltreexpather.pl / xpather.pl , compare the hierarchy

HTML::Element=HASH(0xcac644) 0.1.0.8.0.1.2.0.0 The fire of life. An introduction to animal energetics. /html/body/div/div[5]/div/div[2]/div[2]/div/h3 //div[@id='gs_ccl']/div[2]/div/h3 //div[@id='gs_ccl']/div[@style='z-index:400' and @class='gs_r']/div[@c +lass='gs_ri']/h3[@class='gs_rt'] ------------------------------------------------------------------ HTML::Element=HASH(0xcac534) 0.1.0.8.0.1.2.0.1 M Kleiber - The fire of life. An introduction to animal energetics., 1 +961 - cabdirect.org /html/body/div/div[5]/div/div[2]/div[2]/div/div //div[@id='gs_ccl']/div[2]/div/div //div[@id='gs_ccl']/div[@style='z-index:400' and @class='gs_r']/div[@c +lass='gs_ri']/div[@class='gs_a'] ------------------------------------------------------------------

gs_a is not a child of gs_rt, they're siblings, they're bot children of gs_ri

//div[@class='gs_r']/div[@class='gs_ri']/h3[@class='gs_rt'] //div[@class='gs_r']/div[@class='gs_ri']/div[@class='gs_a']


Comment on Re^3: Problem in using Web::Scraper, coming from HTML::TreeBuilder::XPath
Select or Download Code
Re^4: Problem in using Web::Scraper, coming from HTML::TreeBuilder::XPath
by sbasbasba (Initiate) on Oct 06, 2013 at 20:34 UTC

    I Don't know why but I keep on seeing that message :(

Re^4: Problem in using Web::Scraper, coming from HTML::TreeBuilder::XPath
by sbasbasba (Initiate) on Oct 07, 2013 at 05:19 UTC

    I was finally able to run the code without errors!! Turns out that I was missing some extra files in the HTML::Treebuilder folder.

    However, I am still not able to print results on a text file. I am using your code, and at the end:

    for my $out (@{$res->{out}}) { print F3 "$out->{title} $out->{info} $out->{info}\n"; }

    to print the file (F3 is the output file). Which is similar to what I found on the CPAN website (example with the tweets). But I get something like "HASH(0x100d35ef0)" in my text file, and that's all: no data. What am I doing wrong?

    Many, many, many thanks!!!

      What am I doing wrong?

      I don't know, I can't tell what you're doing. or you're not reading closely enough what I have written.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1057126]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others musing on the Monastery: (10)
As of 2015-07-07 06:20 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (87 votes), past polls