HTML stripper in WWW::Mechanize doesn't seem to work

lampros21_7 has asked for the wisdom of the Perl Monks concerning the following question:

Hi, after a lot of messing about and a LOT of help from the monks i have almost finished a WebCrawler. The problem is that i would want the WebCrawler to strip the HTML of a URL and return a string with just its contents. I use the HTML stripper found within WWW::Mechanize but i don't think it works.

This is the output i get:

WEB CRAWLER AND HTML EXTRACTOR
Please input the URL of the site to be searched
Please use a full URL (eg. http://www.dcs.shef.ac.
http://www.google.com/
<html><head><meta http-equiv="content-type" conten
-1"><title>Google</title><style><!--
body,td,a,p,.h{font-family:arial,sans-serif;}
.h{font-size: 20px;}
.q{color:#0000cc;}
//-->
</style>
<script>
<!--
function sf(){document.f.q.focus();}
// -->
</script>
</head><body bgcolor=#ffffff text=#000000 link=#00
00 onLoad=sf() topmargin=3 marginheight=3><center>
.gif" width=276 height=110 alt="Google"><br><br>
Terminating on signal SIGINT(2)
[download]

The first 3 lines above are my input and then it is what comes up. My code is this:

use WWW::Mechanize;
 use URI;
  
 print "WEB CRAWLER AND HTML EXTRACTOR \n";
 print "Please input the URL of the site to be searched \n";
 print "Please use a full URL (eg. http://www.dcs.shef.ac.uk/) \n";
 
 #Create an instance of the webcrawler
 my $webcrawler = WWW::Mechanize->new();

 my $url_name = <STDIN>; # The user inputs the URL to be searched
 
 my $uri = URI->new($url_name); # Process the URL and make it a URI
 
 #Grab the contents of the URL given by the user
 $webcrawler->get($uri);
  
 # Put the links that exist in the HTML of the URL given by the user i
+n an array
 my @website_links = $webcrawler->links($uri);  
 
 # The HTML is stripped off the contents and the text is stored in an 
+array of  strings
 my $x = 0;
 my @stripped_html;
 $stripped_html[$x] = $webcrawler->content( format => "text" );
 print $stripped_html[$x];
[download]

Am i doing something wrong here or is the $webcrawler->content( format => "text" ); function in WWW::Mechanize really not working? Thanks

Comment on HTML stripper in WWW::Mechanize doesn't seem to work Select or Download Code

Replies are listed 'Best First'.
Re: HTML stripper in WWW::Mechanize doesn't seem to work by Nkuvu (Priest) on Jul 31, 2005 at 17:33 UTC
Based on the comment in your code ("The HTML is stripped off the contents and the text is stored in an array of strings") you're assigning the content incorrectly. Note that I don't have WWW::Mechanize installed so can't double check the docs for that. The `@stripped_html` is an array, just like you need. But `$stripped_html[$x]` is only one element in that array, which means that it's really a scalar¹. Since the content sub returns an array, you're trying to assign an array to a scalar, and you'll end up with the number of things in the array. You'll need to change your code a bit. `# Note that the $x isn't needed with this approach, # so I took it out. my @stripped_html; @stripped_html = $webcrawler->content( format => "text" ); # You can print the array directly, like this: print @stripped_html; # Or put it in a loop to specify what you want between # the array elements: for my $item (@stripped_html) { print "$item\n"; }` [download] As is, this code prints out the HTML contents twice. Just so you can see the different ways to print an array, which wasn't your question so I'll stop blathering on about that now. ¹ Yes, it could be another array or a hash or whatever, I'm talking simplest case scenario here.	[reply] [d/l] [select]
Re^2: HTML stripper in WWW::Mechanize doesn't seem to work by polettix (Vicar) on Aug 01, 2005 at 00:32 UTC
can't double check the docs for that No need to install anything in general. If you need docs for a module, you'd be able to find them on http://search.cpan.org (e.g. WWW::Mechanize). If you need core docs, you can check http://perldoc.perl.org (e.g. map or perlxstut). Flavio perl -ple'$_=reverse' <<<ti.xittelop@oivalf Don't fool yourself.	[reply]
Re^2: HTML stripper in WWW::Mechanize doesn't seem to work by lampros21_7 (Scribe) on Aug 01, 2005 at 01:21 UTC
Right, apologies for this but i 've confused you on one thing. I want one set of stripped HTML to be assigned to one element of the array. So, if www.google.com was my initial website all its contents would be stored in $stripped_html[0], then by doing a x = x + 1; i would move to the next element of the array and assign the next URL's contents to it.Thanks	[reply]
Re^3: HTML stripper in WWW::Mechanize doesn't seem to work by Nkuvu (Priest) on Aug 01, 2005 at 02:19 UTC
You can do this, but you'll have to do something like a `join` first. Consider the simpler example: `my @mango = ('one', 'two', 'three', 'penguin'); my $result = @mango; print "Result is $result\n"; # prints 4 $result = join ' ', @mango; print "Result is $result\n"; # prints "one two three penguin"` [download] If the content subroutine returns an array and you assign it in scalar context, you get the count of the things in the array. For your particular code you'll want something like: `$stripped_html = join ' ', $webcrawler->content( format => "text" );`	[reply] [d/l] [select]
Re^4: HTML stripper in WWW::Mechanize doesn't seem to work by sk (Curate) on Aug 01, 2005 at 03:49 UTC
Re^5: HTML stripper in WWW::Mechanize doesn't seem to work by Nkuvu (Priest) on Aug 01, 2005 at 04:20 UTC
Re: HTML stripper in WWW::Mechanize doesn't seem to work by sk (Curate) on Jul 31, 2005 at 17:53 UTC
Update2: No idea what i was reading :$. I did not notice that you wanted contents stripping off HTML and links. I read it as "strip everything off except links". Oh well! Anyways I added this part of the code into the program i have below - `my $x = 0; # Would be at the top before the loop my @stripped_html; $stripped_html[$x++] = $webcrawler->content( format => "text" ); # Loop back get more URLS and keep processing. map { print $_,$/; } @stripped_html` [download] It seems to be giving the content without any HTML tags but it looks kind of funny for Google. I tried our PM site and works fine but they are not stored as an array of strings it puts the entire content in the first element. output for google.com `GoogleWebááááImagesááááGroupsááááNewsááááFroogleááááLocaláááámoreá&#95 +59;áááAdvanced S earchááPreferencesááLanguage ToolsAdvertisingáPrograms - Business Solu +tions - Ab out Google⌐2005 Google - Searching 8,058,044,651 web pages` [download] end update2 I think your problem is with the way you are using `$webcrawler`. You ask for content it will give you the content. You stripped off everything but the content and put it in `@website_links` but you are not using that? I am confused. I don't think contents of `$webcrawler` was changed by calling `links` mehtod. Anyways you need to have a link obect to print the links at least this is what i see in the docs `$mech->links() When called in a list context, returns a list of the links found in th +e last fetched page. In a scalar context it returns a reference to an + array with those links. Each link is a WWW::Mechanize::Link object.` [download] I installed mechanize but Link does not seem to be getting installed for me. Will try to compile it and check it out aggain. Hopefully you can figure out the issue from here. -SK update: Here is a condensed version of your script #!/usr/bin/perl -w use WWW::Mechanize; use URI; print "WEB CRAWLER AND HTML EXTRACTOR \n"; #Create an instance of the webcrawler my $webcrawler = WWW::Mechanize->new(); my $url_name = "http://www.google.com"; my $uri = URI->new($url_name); # Process the URL and make it a URI #Grab the contents of the URL given by the user $webcrawler->get($uri); die "Failed\n" unless $webcrawler->success(); # Check for return sta +tus # links() retuns a Link object. map { print ($_->url(),"\n"); } $webcrawler->links($uri); [download] Output `WEB CRAWLER AND HTML EXTRACTOR /imghp?hl=en&tab=wi&ie=UTF-8 http://groups-beta.google.com/grphp?hl=en&tab=wg&ie=UTF-8 /nwshp?hl=en&tab=wn&ie=UTF-8 /frghp?hl=en&tab=wf&ie=UTF-8 /lochp?hl=en&tab=wl&ie=UTF-8 /intl/en/options/ /advanced_search?hl=en /preferences?hl=en /language_tools?hl=en /ads/ /intl/en/services/ /intl/en/about.html` [download] There are only two major things i changed from your code (others were used to reduce code size for testing) 1. Check for return status 2. `links()` returns a Link object so use `url()` method on it. Check out the map section (you can store it in an array and then do the printing if you want )	[reply] [d/l] [select]
Re: HTML stripper in WWW::Mechanize doesn't seem to work by johnnywang (Priest) on Jul 31, 2005 at 19:58 UTC
I can't find in the documentationon your "content(format=>'text')" call. You probably should use some other parsers, such as HTML::TokeParser: `use WWW::Mechanize; use HTML::TokeParser; my $webcrawler = WWW::Mechanize->new(); $webcrawler->get("http://www.google.com"); my $content = $webcrawler->content; my $parser = HTML::TokeParser->new(\$content); while($parser->get_tag){ print $parser->get_trimmed_text(),"\n"; }` [download]	[reply] [d/l]
Re^2: HTML stripper in WWW::Mechanize doesn't seem to work by Nkuvu (Priest) on Jul 31, 2005 at 21:14 UTC
I installed WWW::Mechanize (I may even end up using it someday), and looked through the docs: $mech->content(...) Returns the content that the mech uses internally for the last page fetched. Ordinarily this is the same as $mech->response()->content(), but this may differ for HTML documents if "update_html" is overloaded (in which case the value passed to the base-class implementation of same will be returned), and/or extra named arguments are passed to con +- tent(): $mech->content( format => "text" ) Returns a text-only version of the page, with all HTML markup stripped. This feature requires HTML::TreeBuilder to be installed, o +r a fatal error will be thrown. [download] So it looks like the call is correct.	[reply] [d/l]
Re^2: HTML stripper in WWW::Mechanize doesn't seem to work by GrandFather (Saint) on Jul 31, 2005 at 21:51 UTC
or OP could use HTML::TreeBuilder as shown in my reply (Re^3: Syntax error for WWW::Mechanize) to OP's first post Syntax error for WWW::Mechanize `:-)` Perl is Huffman encoded by design.	[reply] [d/l]
Re^3: HTML stripper in WWW::Mechanize doesn't seem to work by lampros21_7 (Scribe) on Aug 01, 2005 at 10:43 UTC
Right, i have made the neccessary changes and i think the code works fine now. The problem is i don't quite think the content( format => "text" ); function in the WWW::Mechanize http://search.cpan.org/dist/WWW-Mechanize/lib/WWW/Mechanize.pm module works. I have used it with google and perlmonks.com and it gives me the whole content. Does anyone else have the same problem or is it something with my code? Updated code: #!/usr/bin/perl use strict; #Module used to go through the web pages, Can extract links, save them + and also strip the HTML of its contents use WWW::Mechanize; use URI; print "WEB CRAWLER AND HTML EXTRACTOR \n"; print "Please input the URL of the site to be searched \n"; print "Please use a full URL (eg. http://www.dcs.shef.ac.uk/) \n"; #Create an instance of the webcrawler my $webcrawler = WWW::Mechanize->new(); my $url_name = <STDIN>; # The user inputs the URL to be searched my $uri = URI->new($url_name); # Process the URL and make it a URI #Grab the contents of the URL given by the user $webcrawler->get($uri); # Put the links that exist in the HTML of the URL given by the user i +n an array my @website_links = $webcrawler->links($uri); # The HTML is stripped off the contents and the text is stored in an +array of strings my $x = 0; my @stripped_html; $stripped_html[$x] = join ' ', $webcrawler->content( format => "text" + ); print $stripped_html[$x]; $x = $x + 1; exit; [download] Thanks	[reply] [d/l]

Back to Seekers of Perl Wisdom