Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

Re: WWW::Mechanize::TreeBuilder and WWW::Mechanize. Following links but can't return without error

by Anonymous Monk
on Jan 04, 2013 at 23:32 UTC ( #1011738=note: print w/ replies, xml ) Need Help??

Comment on Re: WWW::Mechanize::TreeBuilder and WWW::Mechanize. Following links but can't return without error
Replies are listed 'Best First'.
Re^2: WWW::Mechanize::TreeBuilder and WWW::Mechanize. Following links but can't return without error
by mdro79 (Initiate) on Jan 05, 2013 at 00:05 UTC

    Oh sorry, here is the full code (I thought a snippet would be more clear in that case). This is everything from the folder, the 4 html pages and the perl code.

    mech.pl

    #!/usr/bin/perl use strict; use warnings; use WWW::Mechanize; use WWW::Mechanize::TreeBuilder; my $mech = WWW::Mechanize->new(); WWW::Mechanize::TreeBuilder->meta->apply($mech); $mech->get('file:///home/example/path/index.html'); die "Cannot open file: ", $mech->response->status_line unless $mech->success; my @list = $mech->look_down(_tag => "a", class => "links); foreach (@list) { my $url = "file:///home/example/path/"; $url = $url . $_->attr("href"); print $_->as_text(), " - ", $url, "\n"; $mech->get($url); $mech->back(); }

    index.html

    <html> <head> <title>Web Scraper Testing Grounds</title> <style> .links { font-family: Sans-Serif; } .spans { font-family: Serif; } .trs { border: 1px solid; } </style> </head> <body> <h1>Test Page for WWW::Mechanize scraping</h1> <table> <tr class="trs"> <td><a href="s1.html" class="links">S1 Link Content</a></td> <td><span class="spans">Page S1</span></td> </tr> <tr class="trs"> <td><a href="s2.html" class="links">S2 Link Content</a></td> <td><span class="spans">Page S2</span></td> </tr> <tr class="trs"> <td><a href="s3.html" class="links">S3 Link Content</a></td> <td><span class="spans">Page S3</td> </tr> </table> </body> </html>

    s1.html

    <html> <head></head> <body> <h1>This is the S1 page, first in set</h1> </body> </html>

    s2.html

    <html> <head></head> <body> <h1>This is the S2 page, second in set</h1> </body> </html>

    s3.html

    <html> <head></head> <body> <h1>This is the S3 page, third and final</h1> </body> </html>

    When I try and run the code, this is the output I get

    mdro79@mycpu$ ./mech.pl S1 Link Content - file:///home/example/path/s1.html Use of uninitialized value in concatenation (.) or string at ./scrMe.p +l line 31. Use of uninitialized value $tag in string eq at /usr/local/share/perl/ +5.14.2/HTML/Element.pm line 1109. Use of uninitialized value $tag in string eq at /usr/local/share/perl/ +5.14.2/HTML/Element.pm line 1109. - file:///home/example/path/ Use of uninitialized value in concatenation (.) or string at ./scrMe.p +l line 31. Use of uninitialized value $tag in string eq at /usr/local/share/perl/ +5.14.2/HTML/Element.pm line 1109. Use of uninitialized value $tag in string eq at /usr/local/share/perl/ +5.14.2/HTML/Element.pm line 1109. - file:///home/example/path/
      You do not have to go back if you are not following the links. Also, if $url is static, you can declare it just once before entering the loop.
      لսႽ ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ

      Whoa, well that is the trouble with trees and trying to save memory :) you can work around it like this

      @list = map { $_->clone } @list;
      or
      my @list = map { $_->clone } $mech->look_down(_tag => "a", class => "l +inks" );

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1011738]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others having an uproarious good time at the Monastery: (14)
As of 2015-07-07 20:12 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (93 votes), past polls