comment on

Hi, after a lot of messing about and a LOT of help from the monks i have almost finished a WebCrawler. The problem is that i would want the WebCrawler to strip the HTML of a URL and return a string with just its contents. I use the HTML stripper found within WWW::Mechanize but i don't think it works.

This is the output i get:

WEB CRAWLER AND HTML EXTRACTOR
Please input the URL of the site to be searched
Please use a full URL (eg. http://www.dcs.shef.ac.
http://www.google.com/
<html><head><meta http-equiv="content-type" conten
-1"><title>Google</title><style><!--
body,td,a,p,.h{font-family:arial,sans-serif;}
.h{font-size: 20px;}
.q{color:#0000cc;}
//-->
</style>
<script>
<!--
function sf(){document.f.q.focus();}
// -->
</script>
</head><body bgcolor=#ffffff text=#000000 link=#00
00 onLoad=sf() topmargin=3 marginheight=3><center>
.gif" width=276 height=110 alt="Google"><br><br>
Terminating on signal SIGINT(2)
[download]

The first 3 lines above are my input and then it is what comes up. My code is this:

use WWW::Mechanize;
 use URI;
  
 print "WEB CRAWLER AND HTML EXTRACTOR \n";
 print "Please input the URL of the site to be searched \n";
 print "Please use a full URL (eg. http://www.dcs.shef.ac.uk/) \n";
 
 #Create an instance of the webcrawler
 my $webcrawler = WWW::Mechanize->new();

 my $url_name = <STDIN>; # The user inputs the URL to be searched
 
 my $uri = URI->new($url_name); # Process the URL and make it a URI
 
 #Grab the contents of the URL given by the user
 $webcrawler->get($uri);
  
 # Put the links that exist in the HTML of the URL given by the user i
+n an array
 my @website_links = $webcrawler->links($uri);  
 
 # The HTML is stripped off the contents and the text is stored in an 
+array of  strings
 my $x = 0;
 my @stripped_html;
 $stripped_html[$x] = $webcrawler->content( format => "text" );
 print $stripped_html[$x];
[download]

Am i doing something wrong here or is the $webcrawler->content( format => "text" ); function in WWW::Mechanize really not working? Thanks

In reply to HTML stripper in WWW::Mechanize doesn't seem to work by lampros21_7

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


go ahead... be a heretic
	PerlMonks