http://www.perlmonks.org?node_id=822189

jdlev has asked for the wisdom of the Perl Monks concerning the following question:

Is there a simple way to count the number of times a word appears on a web page?
I love it when a program comes together - jdhannibal

Replies are listed 'Best First'.
Re: Count number of words on a web page?
by JSchmitz (Canon) on Feb 09, 2010 at 13:39 UTC
    #!/usr/bin/perl $/ = ""; $* = 1; while (<>) { #s/-\n//g; tr/A-Z/a-z/; @words = split(/\W*\s+\W*/, $_); # split into words foreach $word (@words) { $wordcount{$word}++; # count the words } } foreach $word (sort keys(%wordcount)) { printf "%8d\t\t%s\n", $wordcount{$word}, $word; }

    Hope that helps

    Jeffery
      Use word boundaries (\b).
      $_ ='The,quick,brown;foxy. Does a lot,of:jumping!'; my @words = split(/\W*\s+\W*/, $_); # split into words print 'Number of words: '.@words."\n";
      gives 4 words. However
      my @words = split(/\b\W*/, $_); # split into words
      gives 9 words.
      Thanks!
      I love it when a program comes together - jdhannibal
Re: Count number of words on a web page?
by zentara (Archbishop) on Feb 09, 2010 at 17:57 UTC
    I would add that you might want to extract pure text from the html page, before word counting.... otherwise you will get html artifacts in your text count, like brackets and slashes. Read perldoc -q 'remove html'

    Other ideas:

    #!/usr/bin/perl #You could use HTML::TokeParser::Simple and only print text tags. #almost straight from the TokeParser::Simple POD use HTML::TokeParser::Simple; my $p = HTML::TokeParser::Simple->new( $somefile ); while ( my $token = $p->get_token ) { print $token->as_is if $token->is_text; } ################################################################### #HTML::Strip - Perl extension for stripping HTML markup from text. use HTML::Strip; my $hs = HTML::Strip->new(); my $clean_text = $hs->parse( $raw_html ); $hs->eof; ################################################################### sub strip { my $html = shift; my $p = HTML::PullParser->new( doc => $html, text => 'text', ); my $result = ''; while ( my $t = $p->get_token ) { $result .= $t->[0]; } return $result; } ############################################################## #If you just need to strip all the html tags from a page, #and are on a platform with lynx, you can use: #! /usr/bin/perl use strict; use warnings; my $text=`lynx -dump htmlDocument.html`; print "$text"; ################################################################## #or lynx -dump htmlDocument.html > htmlDocument.txt

    I'm not really a human, but I play one on earth.
    Old Perl Programmer Haiku
Re: Count number of words on a web page?
by planetscape (Chancellor) on Feb 09, 2010 at 18:29 UTC