http://www.perlmonks.org?node_id=479716

lampros21_7 has asked for the wisdom of the Perl Monks concerning the following question:

Hi, after a lot of messing about and a LOT of help from the monks i have almost finished a WebCrawler. The problem is that i would want the WebCrawler to strip the HTML of a URL and return a string with just its contents. I use the HTML stripper found within WWW::Mechanize but i don't think it works.

This is the output i get:

WEB CRAWLER AND HTML EXTRACTOR Please input the URL of the site to be searched Please use a full URL (eg. http://www.dcs.shef.ac. http://www.google.com/ <html><head><meta http-equiv="content-type" conten -1"><title>Google</title><style><!-- body,td,a,p,.h{font-family:arial,sans-serif;} .h{font-size: 20px;} .q{color:#0000cc;} //--> </style> <script> <!-- function sf(){document.f.q.focus();} // --> </script> </head><body bgcolor=#ffffff text=#000000 link=#00 00 onLoad=sf() topmargin=3 marginheight=3><center> .gif" width=276 height=110 alt="Google"><br><br> Terminating on signal SIGINT(2)
The first 3 lines above are my input and then it is what comes up. My code is this:
use WWW::Mechanize; use URI; print "WEB CRAWLER AND HTML EXTRACTOR \n"; print "Please input the URL of the site to be searched \n"; print "Please use a full URL (eg. http://www.dcs.shef.ac.uk/) \n"; #Create an instance of the webcrawler my $webcrawler = WWW::Mechanize->new(); my $url_name = <STDIN>; # The user inputs the URL to be searched my $uri = URI->new($url_name); # Process the URL and make it a URI #Grab the contents of the URL given by the user $webcrawler->get($uri); # Put the links that exist in the HTML of the URL given by the user i +n an array my @website_links = $webcrawler->links($uri); # The HTML is stripped off the contents and the text is stored in an +array of strings my $x = 0; my @stripped_html; $stripped_html[$x] = $webcrawler->content( format => "text" ); print $stripped_html[$x];
Am i doing something wrong here or is the $webcrawler->content( format => "text" ); function in WWW::Mechanize really not working? Thanks

Replies are listed 'Best First'.
Re: HTML stripper in WWW::Mechanize doesn't seem to work
by Nkuvu (Priest) on Jul 31, 2005 at 17:33 UTC
    Based on the comment in your code ("The HTML is stripped off the contents and the text is stored in an array of strings") you're assigning the content incorrectly. Note that I don't have WWW::Mechanize installed so can't double check the docs for that.

    The @stripped_html is an array, just like you need. But $stripped_html[$x] is only one element in that array, which means that it's really a scalar1. Since the content sub returns an array, you're trying to assign an array to a scalar, and you'll end up with the number of things in the array.

    You'll need to change your code a bit.

    # Note that the $x isn't needed with this approach, # so I took it out. my @stripped_html; @stripped_html = $webcrawler->content( format => "text" ); # You can print the array directly, like this: print @stripped_html; # Or put it in a loop to specify what you want between # the array elements: for my $item (@stripped_html) { print "$item\n"; }
    As is, this code prints out the HTML contents twice. Just so you can see the different ways to print an array, which wasn't your question so I'll stop blathering on about that now.

    1 Yes, it could be another array or a hash or whatever, I'm talking simplest case scenario here.

      Right, apologies for this but i 've confused you on one thing. I want one set of stripped HTML to be assigned to one element of the array. So, if www.google.com was my initial website all its contents would be stored in $stripped_html[0], then by doing a x = x + 1; i would move to the next element of the array and assign the next URL's contents to it.Thanks
        You can do this, but you'll have to do something like a join first.

        Consider the simpler example:

        my @mango = ('one', 'two', 'three', 'penguin'); my $result = @mango; print "Result is $result\n"; # prints 4 $result = join ' ', @mango; print "Result is $result\n"; # prints "one two three penguin"
        If the content subroutine returns an array and you assign it in scalar context, you get the count of the things in the array. For your particular code you'll want something like: $stripped_html = join ' ', $webcrawler->content( format => "text" );
Re: HTML stripper in WWW::Mechanize doesn't seem to work
by sk (Curate) on Jul 31, 2005 at 17:53 UTC
    Update2: No idea what i was reading :$. I did not notice that you wanted contents stripping off HTML and links. I read it as "strip everything off except links". Oh well!

    Anyways I added this part of the code into the program i have below -

    my $x = 0; # Would be at the top before the loop my @stripped_html; $stripped_html[$x++] = $webcrawler->content( format => "text" ); # Loop back get more URLS and keep processing. map { print $_,$/; } @stripped_html

    It seems to be giving the content without any HTML tags but it looks kind of funny for Google. I tried our PM site and works fine but they are not stored as an array of strings it puts the entire content in the first element.

    output for google.com

    GoogleWebááááImagesááááGroupsááááNewsááááFroogleááááLocaláááámoreá&#95 +59;áááAdvanced S earchááPreferencesááLanguage ToolsAdvertisingáPrograms - Business Solu +tions - Ab out Google&#8976;2005 Google - Searching 8,058,044,651 web pages

    end update2

    I think your problem is with the way you are using  $webcrawler. You ask for content it will give you the content. You stripped off everything but the content and put it in  @website_links but you are not using that?

    I am confused. I don't think contents of  $webcrawler was changed by calling  links mehtod.

    Anyways you need to have a link obect to print the links at least this is what i see in the docs

    $mech->links() When called in a list context, returns a list of the links found in th +e last fetched page. In a scalar context it returns a reference to an + array with those links. Each link is a WWW::Mechanize::Link object.

    I installed mechanize but Link does not seem to be getting installed for me. Will try to compile it and check it out aggain. Hopefully you can figure out the issue from here.

    -SK

    update: Here is a condensed version of your script

    #!/usr/bin/perl -w use WWW::Mechanize; use URI; print "WEB CRAWLER AND HTML EXTRACTOR \n"; #Create an instance of the webcrawler my $webcrawler = WWW::Mechanize->new(); my $url_name = "http://www.google.com"; my $uri = URI->new($url_name); # Process the URL and make it a URI #Grab the contents of the URL given by the user $webcrawler->get($uri); die "Failed\n" unless $webcrawler->success(); # Check for return sta +tus # links() retuns a Link object. map { print ($_->url(),"\n"); } $webcrawler->links($uri);

    Output

    WEB CRAWLER AND HTML EXTRACTOR /imghp?hl=en&tab=wi&ie=UTF-8 http://groups-beta.google.com/grphp?hl=en&tab=wg&ie=UTF-8 /nwshp?hl=en&tab=wn&ie=UTF-8 /frghp?hl=en&tab=wf&ie=UTF-8 /lochp?hl=en&tab=wl&ie=UTF-8 /intl/en/options/ /advanced_search?hl=en /preferences?hl=en /language_tools?hl=en /ads/ /intl/en/services/ /intl/en/about.html

    There are only two major things i changed from your code (others were used to reduce code size for testing)

    1. Check for return status

    2.  links() returns a Link object so use  url() method on it. Check out the map section (you can store it in an array and then do the printing if you want )

Re: HTML stripper in WWW::Mechanize doesn't seem to work
by johnnywang (Priest) on Jul 31, 2005 at 19:58 UTC
    I can't find in the documentationon your "content(format=>'text')" call. You probably should use some other parsers, such as HTML::TokeParser:
    use WWW::Mechanize; use HTML::TokeParser; my $webcrawler = WWW::Mechanize->new(); $webcrawler->get("http://www.google.com"); my $content = $webcrawler->content; my $parser = HTML::TokeParser->new(\$content); while($parser->get_tag){ print $parser->get_trimmed_text(),"\n"; }
      I installed WWW::Mechanize (I may even end up using it someday), and looked through the docs:
      $mech->content(...) Returns the content that the mech uses internally for the last page fetched. Ordinarily this is the same as $mech->response()->content(), but this may differ for HTML documents if "update_html" is overloaded (in which case the value passed to the base-class implementation of same will be returned), and/or extra named arguments are passed to con +- tent(): $mech->content( format => "text" ) Returns a text-only version of the page, with all HTML markup stripped. This feature requires HTML::TreeBuilder to be installed, o +r a fatal error will be thrown.
      So it looks like the call is correct.
        Right, i have made the neccessary changes and i think the code works fine now. The problem is i don't quite think the content( format => "text" ); function in the WWW::Mechanize http://search.cpan.org/dist/WWW-Mechanize/lib/WWW/Mechanize.pm module works. I have used it with google and perlmonks.com and it gives me the whole content. Does anyone else have the same problem or is it something with my code?

        Updated code:

        #!/usr/bin/perl use strict; #Module used to go through the web pages, Can extract links, save them + and also strip the HTML of its contents use WWW::Mechanize; use URI; print "WEB CRAWLER AND HTML EXTRACTOR \n"; print "Please input the URL of the site to be searched \n"; print "Please use a full URL (eg. http://www.dcs.shef.ac.uk/) \n"; #Create an instance of the webcrawler my $webcrawler = WWW::Mechanize->new(); my $url_name = <STDIN>; # The user inputs the URL to be searched my $uri = URI->new($url_name); # Process the URL and make it a URI #Grab the contents of the URL given by the user $webcrawler->get($uri); # Put the links that exist in the HTML of the URL given by the user i +n an array my @website_links = $webcrawler->links($uri); # The HTML is stripped off the contents and the text is stored in an +array of strings my $x = 0; my @stripped_html; $stripped_html[$x] = join ' ', $webcrawler->content( format => "text" + ); print $stripped_html[$x]; $x = $x + 1; exit;

        Thanks