Re: HTML stripper in WWW::Mechanize doesn't seem to work

in reply to HTML stripper in WWW::Mechanize doesn't seem to work

Update2: No idea what i was reading :$. I did not notice that you wanted contents stripping off HTML and links. I read it as "strip everything off except links". Oh well!

Anyways I added this part of the code into the program i have below -

my $x = 0; # Would be at the top before the loop
my @stripped_html;
$stripped_html[$x++] = $webcrawler->content( format => "text" );
# Loop back get more URLS and keep processing.
map { print $_,$/; } @stripped_html
[download]

It seems to be giving the content without any HTML tags but it looks kind of funny for Google. I tried our PM site and works fine but they are not stored as an array of strings it puts the entire content in the first element.

output for google.com

GoogleWebááááImagesááááGroupsááááNewsááááFroogleááááLocaláááámoreá&#95
+59;áááAdvanced S
earchááPreferencesááLanguage ToolsAdvertisingáPrograms - Business Solu
+tions - Ab
out Google&#8976;2005 Google - Searching 8,058,044,651 web pages
[download]

end update2

I think your problem is with the way you are using $webcrawler. You ask for content it will give you the content. You stripped off everything but the content and put it in @website_links but you are not using that?

I am confused. I don't think contents of $webcrawler was changed by calling links mehtod.

Anyways you need to have a link obect to print the links at least this is what i see in the docs

$mech->links()

When called in a list context, returns a list of the links found in th
+e last fetched page. In a scalar context it returns a reference to an
+ array with those links. Each link is a WWW::Mechanize::Link object.
[download]

I installed mechanize but Link does not seem to be getting installed for me. Will try to compile it and check it out aggain. Hopefully you can figure out the issue from here.

-SK

update: Here is a condensed version of your script

#!/usr/bin/perl -w
use WWW::Mechanize;
use URI;

print "WEB CRAWLER AND HTML EXTRACTOR \n";

#Create an instance of the webcrawler
my $webcrawler = WWW::Mechanize->new();

my $url_name = "http://www.google.com";

my $uri = URI->new($url_name); # Process the URL and make it a URI

#Grab the contents of the URL given by the user
$webcrawler->get($uri);

die "Failed\n" unless $webcrawler->success();   # Check for return sta
+tus

# links() retuns a Link object.
map { print ($_->url(),"\n"); } $webcrawler->links($uri);
[download]

Output

WEB CRAWLER AND HTML EXTRACTOR
/imghp?hl=en&tab=wi&ie=UTF-8
http://groups-beta.google.com/grphp?hl=en&tab=wg&ie=UTF-8
/nwshp?hl=en&tab=wn&ie=UTF-8
/frghp?hl=en&tab=wf&ie=UTF-8
/lochp?hl=en&tab=wl&ie=UTF-8
/intl/en/options/
/advanced_search?hl=en
/preferences?hl=en
/language_tools?hl=en
/ads/
/intl/en/services/
/intl/en/about.html
[download]

There are only two major things i changed from your code (others were used to reduce code size for testing)

1. Check for return status

2. links() returns a Link object so use url() method on it. Check out the map section (you can store it in an array and then do the printing if you want )

Comment on Re: HTML stripper in WWW::Mechanize doesn't seem to work Select or Download Code

In Section Seekers of Perl Wisdom