Count number of words on a web page?

jdlev has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Count number of words on a web page? by JSchmitz (Canon) on Feb 09, 2010 at 13:39 UTC
`#!/usr/bin/perl $/ = ""; $* = 1; while (<>) { #s/-\n//g; tr/A-Z/a-z/; @words = split(/\W\s+\W/, $_); # split into words foreach $word (@words) { $wordcount{$word}++; # count the words } } foreach $word (sort keys(%wordcount)) { printf "%8d\t\t%s\n", $wordcount{$word}, $word; }` [download] Hope that helps Jeffery	[reply] [d/l]
Re^2: Count number of words on a web page? by cdarke (Prior) on Feb 09, 2010 at 16:26 UTC
Use word boundaries (\b). `$_ ='The,quick,brown;foxy. Does a lot,of:jumping!'; my @words = split(/\W\s+\W/, $_); # split into words print 'Number of words: '.@words."\n";` [download] gives 4 words. However `my @words = split(/\b\W*/, $_); # split into words` [download] gives 9 words.	[reply] [d/l] [select]
Re^2: Count number of words on a web page? by jdlev (Scribe) on Feb 09, 2010 at 13:49 UTC
Thanks! I love it when a program comes together - jdhannibal	[reply]
Re: Count number of words on a web page? by zentara (Archbishop) on Feb 09, 2010 at 17:57 UTC
I would add that you might want to extract pure text from the html page, before word counting.... otherwise you will get html artifacts in your text count, like brackets and slashes. Read perldoc -q 'remove html' Other ideas: #!/usr/bin/perl #You could use HTML::TokeParser::Simple and only print text tags. #almost straight from the TokeParser::Simple POD use HTML::TokeParser::Simple; my $p = HTML::TokeParser::Simple->new( $somefile ); while ( my $token = $p->get_token ) { print $token->as_is if $token->is_text; } ################################################################### #HTML::Strip - Perl extension for stripping HTML markup from text. use HTML::Strip; my $hs = HTML::Strip->new(); my $clean_text = $hs->parse( $raw_html ); $hs->eof; ################################################################### sub strip { my $html = shift; my $p = HTML::PullParser->new( doc => $html, text => 'text', ); my $result = ''; while ( my $t = $p->get_token ) { $result .= $t->[0]; } return $result; } ############################################################## #If you just need to strip all the html tags from a page, #and are on a platform with lynx, you can use: #! /usr/bin/perl use strict; use warnings; my $text=`lynx -dump htmlDocument.html`; print "$text"; ################################################################## #or lynx -dump htmlDocument.html > htmlDocument.txt [download] I'm not really a human, but I play one on earth. Old Perl Programmer Haiku	[reply] [d/l]
Re: Count number of words on a web page? by planetscape (Chancellor) on Feb 09, 2010 at 18:29 UTC
Depending on your needs, you may wish to take a look at the Ngram Statistics Package, by Ted Pedersen. HTH, planetscape	[reply]


Just another Perl shrine
	PerlMonks