Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Re^2: Problem in using Web::Scraper, coming from HTML::TreeBuilder::XPath

by sbasbasba (Initiate)
on Oct 06, 2013 at 03:26 UTC ( #1057101=note: print w/ replies, xml ) Need Help??

Comment on Re^2: Problem in using Web::Scraper, coming from HTML::TreeBuilder::XPath
Re^3: Problem in using Web::Scraper, coming from HTML::TreeBuilder::XPath
by Anonymous Monk on Oct 06, 2013 at 07:16 UTC
    Well, "it works" for me in that it doesn't die with no kind of error message :) but doesn't work because the html isn't what you think it is :) so no info is extracted
    #!/usr/bin/perl -- use strict; use warnings; use WWW::Mechanize 1.73; use Web::Scraper 0.37; use Data::Dump; my $out = scraper { process ".gs_rt", "title[]" => scraper { process ".gs_a", "info" => 'TEXT'; process q{gs_a}, "info" => 'TEXT'; }; }; my $mech = WWW::Mechanize->new(qw/ autocheck 1 /); $mech->show_progress(1); $mech->get( "http://scholar.google.it/scholar?hl=en&q=Handbuch+der+biologischen+Ar +beitsmethoden" ); if( $mech->follow_link( url_regex => qr/cites/i, n => 1 ) ){ my $result = $mech->content; my $indi = $mech->uri(); my $res = $out->scrape( $result, $indi ); #~ dd( $result, $res ); dd( $res ); } __END__ $ perl web-scraper-google-pm1057095.pl ** GET http://scholar.google.it/scholar?hl=en&q=Handbuch+der+biologisc +hen+Arbeitsmethoden ==> 200 OK (1s) ** GET http://scholar.google.it/scholar?cites=3692889479872081319&as_s +dt=2005&sciodt=0,5&hl=en&oe=ASCII ==> 200 OK { title => [{}, {}, {}, {}, {}, {}, {}, {}, {}, {}] }

    If you want to fixup your 'css paths' use htmltreexpather.pl / xpather.pl , compare the hierarchy

    HTML::Element=HASH(0xcac644) 0.1.0.8.0.1.2.0.0 The fire of life. An introduction to animal energetics. /html/body/div/div[5]/div/div[2]/div[2]/div/h3 //div[@id='gs_ccl']/div[2]/div/h3 //div[@id='gs_ccl']/div[@style='z-index:400' and @class='gs_r']/div[@c +lass='gs_ri']/h3[@class='gs_rt'] ------------------------------------------------------------------ HTML::Element=HASH(0xcac534) 0.1.0.8.0.1.2.0.1 M Kleiber - The fire of life. An introduction to animal energetics., 1 +961 - cabdirect.org /html/body/div/div[5]/div/div[2]/div[2]/div/div //div[@id='gs_ccl']/div[2]/div/div //div[@id='gs_ccl']/div[@style='z-index:400' and @class='gs_r']/div[@c +lass='gs_ri']/div[@class='gs_a'] ------------------------------------------------------------------

    gs_a is not a child of gs_rt, they're siblings, they're bot children of gs_ri

    //div[@class='gs_r']/div[@class='gs_ri']/h3[@class='gs_rt'] //div[@class='gs_r']/div[@class='gs_ri']/div[@class='gs_a']

      I Don't know why but I keep on seeing that message :(

      I was finally able to run the code without errors!! Turns out that I was missing some extra files in the HTML::Treebuilder folder.

      However, I am still not able to print results on a text file. I am using your code, and at the end:

      for my $out (@{$res->{out}}) { print F3 "$out->{title} $out->{info} $out->{info}\n"; }

      to print the file (F3 is the output file). Which is similar to what I found on the CPAN website (example with the tweets). But I get something like "HASH(0x100d35ef0)" in my text file, and that's all: no data. What am I doing wrong?

      Many, many, many thanks!!!

        What am I doing wrong?

        I don't know, I can't tell what you're doing. or you're not reading closely enough what I have written.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1057101]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others meditating upon the Monastery: (5)
As of 2014-12-22 05:14 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (110 votes), past polls