Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

Re: Printing From Several Webpages

by ig (Vicar)
on Jul 06, 2012 at 20:06 UTC ( #980368=note: print w/ replies, xml ) Need Help??


in reply to Printing From Several Webpages

Your description of your problem is a bit vague. You say you don't know how to use foreach loops, but I don't see anything wrong with how you used foreach in what you posted.

However it does not work even though no error shows

No error shows because you are not checking for and reporting errors. For example, the synopsis of LWP::Simple has this example:

use LWP::Simple; $content = get("http://www.sn.no/"); die "Couldn't get it!" unless defined $content;

In your program, you have used get, but you have not checked the result and reported the problem when it doesn't work. Add the check:

#Download all the modules I used# use LWP::Simple; use HTML::TreeBuilder; use HTML::FormatText; use WWW::Mechanize; use Data::Dumper; #Download original webpage and acquire 500+ Links# $url = "http://wx.toronto.ca/festevents.nsf/all?openform"; my $mechanize = WWW::Mechanize->new(autocheck => 1); $mechanize->get($url); my $title = $mechanize->title; print "<b>$title</b><br />"; my @links = $mechanize->links; ## THIS IS WHERE MY PROBLEM STARTS: I dont know how to use foreach loo +ps. I thought if I put the "$link" variable as the "get ()" each tim +e it would go through the loop it would "get" a different webpage. Ho +wever it does not work even though no error shows## foreach my $link (@links) { # Retrieve the link URL my $href = $link->url; $URL1= get("$link"); die "Couldn't get '$link'" unless defined $URL1; $Format=HTML::FormatText->new; $TreeBuilder=HTML::TreeBuilder->new; $TreeBuilder->parse($URL1); $Parsed=$Format->format($TreeBuilder); open(FILE, ">TorontoParties.txt"); print FILE "$Parsed"; close (FILE); }

and you get

Couldn't get 'WWW::Mechanize::Link=ARRAY(0x37c6e2c)' at test.pl line 3 +4. <b>Festival and event calendar - all</b><br />

Notice how $link appears in the error message. It's not a string, it's an object reference, and that's how object references appear when interpolated into strings.

Now check the documentation for LWP::Simple to see if its get method accepts WWW::Mechanize::Link objects. The documentation doesn't say that it does, and the result you are getting suggests that it doesn't, or perhaps there is something else wrong with the link.

One of the problems with LWP::Simple is that it dosn't give you much information when something goes wrong. Note what LWP::Simple says about the get method:

You will not be able to examine the response code or response headers (like 'Content-Type') when you are accessing the web using this function. If you need that information you should use the full OO interface (see LWP::UserAgent).

That's why I usually use LWP::UserAgent. I like to be able to get more information about what whent wrong, when things go wrong. It's not hard to use. In your program you could pretty much just copy the example from the synopsis, substituting your variables:

#Download all the modules I used# use LWP::UserAgent; use HTML::TreeBuilder; use HTML::FormatText; use WWW::Mechanize; use Data::Dumper; #Download original webpage and acquire 500+ Links# $url = "http://wx.toronto.ca/festevents.nsf/all?openform"; my $mechanize = WWW::Mechanize->new(autocheck => 1); $mechanize->get($url); my $title = $mechanize->title; print "<b>$title</b><br />"; my @links = $mechanize->links; ## THIS IS WHERE MY PROBLEM STARTS: I dont know how to use foreach loo +ps. I thought if I put the "$link" variable as the "get ()" each tim +e it would go through the loop it would "get" a different webpage. Ho +wever it does not work even though no error shows## foreach my $link (@links) { # Retrieve the link URL my $href = $link->url; # # $URL1= get("$link"); # my $ua = LWP::UserAgent->new; my $response = $ua->get($link); unless($response->is_success) { die $response->status_line; } my $URL1 = $response->decoded_content; die Dumper($URL1); $Format=HTML::FormatText->new; $TreeBuilder=HTML::TreeBuilder->new; $TreeBuilder->parse($URL1); $Parsed=$Format->format($TreeBuilder); open(FILE, ">TorontoParties.txt"); print FILE "$Parsed"; close (FILE); }

Now when you run you get

Can't use a WWW::Mechanize::Link object as a URI at C:/strawberry/perl +/site/lib/HTTP/Request/Common.pm line 106 <b>Festival and event calendar - all</b><br />

That error message is a bit easier to understand than the previous one. The question is, if one can't use a WWW::Mechanize::Link object as a URI, what can one use. You should be able to find the answer to that question in LWP::UserAgent, but it's not obvious. None the less, you know you need something other than the object you have.

You already got a URL from the $link object. If you try using $href instead of $link in the call to get, you get quite a different result:

400 URL must be absolute at test.pl line 39. <b>Festival and event calendar - all</b><br />

You can check whether $href contains an absolute URL by printing it, but the error is quite plain. Fortunately, WWW::Mechanize::Link has a url_abs method that returns an absolute URL. Use that instead and you get a page back.

#Download all the modules I used# use LWP::UserAgent; use HTML::TreeBuilder; use HTML::FormatText; use WWW::Mechanize; use Data::Dumper; #Download original webpage and acquire 500+ Links# $url = "http://wx.toronto.ca/festevents.nsf/all?openform"; my $mechanize = WWW::Mechanize->new(autocheck => 1); $mechanize->get($url); my $title = $mechanize->title; print "<b>$title</b><br />"; my @links = $mechanize->links; ## THIS IS WHERE MY PROBLEM STARTS: I dont know how to use foreach loo +ps. I thought if I put the "$link" variable as the "get ()" each tim +e it would go through the loop it would "get" a different webpage. Ho +wever it does not work even though no error shows## foreach my $link (@links) { # Retrieve the link URL my $href = $link->url_abs; # # $URL1= get("$link"); # my $ua = LWP::UserAgent->new; my $response = $ua->get($href); unless($response->is_success) { die $response->status_line; } my $URL1 = $response->decoded_content; die Dumper($URL1); $Format=HTML::FormatText->new; $TreeBuilder=HTML::TreeBuilder->new; $TreeBuilder->parse($URL1); $Parsed=$Format->format($TreeBuilder); open(FILE, ">TorontoParties.txt"); print FILE "$Parsed"; close (FILE); }

gives

$VAR1 = "\x{feff}/* Adjust default template */ #header001 {padding-bottom: 17px;} #background-nav{ width: 100%; float: left; overflow: hidden;} .wrapper{width: 100%; } #nav-side{} #nav-side h2{margin-bottom: 0em ! important;} #content{ width: 100%;float: right; margin: 0 -147px 0; } /**/ body,h1, h2, h3, h4, h5, h6, form,input {color: #000; font-family: Ari +al,Helveti ca,sans-serif; margin: 0px; padding: 0px; /*background-color: #fff; * +/} a:hover{color: #000; } h2{font-size: 1.3em;} ol li{ margin-left: 20px;} h2.icon-rss{ background:url(../images/rss14x14.gif) no-repeat 0px 2px; + padding-l eft: 18px;} li.icon-rss{ background:url(../images/rss10x10.gif) no-repeat 0px 4px; + list-sty le: none; margin-left: -15px; padding-left: 15px;} .general-text{line-height: 0em; line-height: 1em ! important; } .general-text.body{ float: left;width: 10%; background:#ccc;} .general-text h2{font-size: 1.3em; margin-bottom: 0.5em;} .bullet {background: url(../images/section1_bullet.gif) no-repeat 0 5p +x; padding -left: 10px;} .shade {color: #999;} .terms-of-use{} .terms-of-use li, .general-text ol li{margin-top: 1em;} .terms-of-use label{ font-weight: bold; font-size: 1.5em; margin-left: + 3em;} #evt-feature{ border: 1px solid #ccc; float: left; clear: both; paddin +g: 3px; wi dth: 396px; } #evt-feature .desc h2{ color: #000; font-size: 1.5em; font-weight: nor +mal; margi n-bottom: 8px; margin-top: 8px;} #evt-feature .desc p{ color: #333; font-size: 0.965em;} #evt-feature .desc .highlight{ background: none; border: none; clear: +both; floa t: left;} #evt-feature .two-column{ float: left;} #evt-feature .two-column .col0{ border-right: 1px solid #ccc; float: l +eft; paddi ng-right: 10px; width: 260px; } #evt-feature .two-column .col1{ float: left; padding-left: 10px; width +: 10px;} #evt-highlight{ clear: both; float: left;margin-top: 14px; width: 404p +x;} #evt-highlight .h{display: block; float: left; width: 129px;} #evt-highlight .h.spacing{margin-left: 7px; margin-right: 7px;} #evt-highlight p.img{border: 1px solid #ccc; padding: 3px; margin-bott +om: 0.05em ;} #evt-highlight p{ font-size: 1em; color: #333; padding: 3px; padding-t +op: 0px;} #category-body{ float: left;} #banner{ border: 1px solid #ccc; clear: both; display: block; float: +left; heig ht: 100px; padding: 3px; width: 82%; margin-bottom: 14px;} #banner h2 { display: block; float: left; font-size: 1.5em; font-weigh +t: normal; margin-top: 5px; margin-bottom: 6px; height: 1.4em;} #banner .img{ display: block; float: left; height: 65px; width: 100%; +} #evt-selection{ display: block; float: left; margin-left: 10px; width: + 200px;} #evt-selection #calendar{border: none ! important; width: 200px; floa +t:left; cl ear:both;} #evt-selection #calendar{margin-bottom: -24px;} #evt-selection form label{ margin-top: 12px;} #evt-selection input.textbox, #evt-selection select.textbox{ width: 18 +9px;} #evt-selection input.button {margin-top: 14px;} #evt-listing{ clear: both; display: block; float: left;} #evt-listing h2{ font-size: 1.2em; font-weight: normal; margin-bottom: + 14px; mar gin-top: 14px;} #evt-listing table{ width: 600px;} #evt-listing table th{ text-align: left; background: #ccc;} #evt-listing table td {padding-top: 0.5em;} #evt-listing table td.col0, #evt-listing table th.col0{ border-left: +1px #fff s olid; width: 7%; padding: 5px;} #evt-listing table td.col1, #evt-listing table th.col1{ border-left: +1px #fff s olid; width: 63%; padding: 5px; padding-left: 10px;} #evt-listing table td.col2, #evt-listing table th.col2{ border-left: +1px #fff s olid; width: 15%; padding: 5px;} #evt-listing table td.col3, #evt-listing table th.col3{ border-left: +1px #fff s olid; width: 25%; padding: 5px;}"; <b>Festival and event calendar - all</b><br />

That looks more like a result you can use.

The point is, by focusing your attention on the problem that you can see, you can investigate and work your way back to the cause of the problem. And, that you should always check to make sure the functions you use succeeded before going on. If they don't, you should handle the failure, usually by producing an error message, sometimes doing more than that, like trying other methods.


Comment on Re: Printing From Several Webpages
Select or Download Code

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://980368]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others examining the Monastery: (8)
As of 2014-12-25 00:36 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (159 votes), past polls