Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

Count number of words on a web page?

by jdlev (Scribe)
on Feb 09, 2010 at 13:33 UTC ( #822189=perlquestion: print w/ replies, xml ) Need Help??
jdlev has asked for the wisdom of the Perl Monks concerning the following question:

Is there a simple way to count the number of times a word appears on a web page?
I love it when a program comes together - jdhannibal

Comment on Count number of words on a web page?
Re: Count number of words on a web page?
by JSchmitz (Canon) on Feb 09, 2010 at 13:39 UTC
    #!/usr/bin/perl $/ = ""; $* = 1; while (<>) { #s/-\n//g; tr/A-Z/a-z/; @words = split(/\W*\s+\W*/, $_); # split into words foreach $word (@words) { $wordcount{$word}++; # count the words } } foreach $word (sort keys(%wordcount)) { printf "%8d\t\t%s\n", $wordcount{$word}, $word; }

    Hope that helps

    Jeffery
      Thanks!
      I love it when a program comes together - jdhannibal
      Use word boundaries (\b).
      $_ ='The,quick,brown;foxy. Does a lot,of:jumping!'; my @words = split(/\W*\s+\W*/, $_); # split into words print 'Number of words: '.@words."\n";
      gives 4 words. However
      my @words = split(/\b\W*/, $_); # split into words
      gives 9 words.
Re: Count number of words on a web page?
by zentara (Archbishop) on Feb 09, 2010 at 17:57 UTC
    I would add that you might want to extract pure text from the html page, before word counting.... otherwise you will get html artifacts in your text count, like brackets and slashes. Read perldoc -q 'remove html'

    Other ideas:

    #!/usr/bin/perl #You could use HTML::TokeParser::Simple and only print text tags. #almost straight from the TokeParser::Simple POD use HTML::TokeParser::Simple; my $p = HTML::TokeParser::Simple->new( $somefile ); while ( my $token = $p->get_token ) { print $token->as_is if $token->is_text; } ################################################################### #HTML::Strip - Perl extension for stripping HTML markup from text. use HTML::Strip; my $hs = HTML::Strip->new(); my $clean_text = $hs->parse( $raw_html ); $hs->eof; ################################################################### sub strip { my $html = shift; my $p = HTML::PullParser->new( doc => $html, text => 'text', ); my $result = ''; while ( my $t = $p->get_token ) { $result .= $t->[0]; } return $result; } ############################################################## #If you just need to strip all the html tags from a page, #and are on a platform with lynx, you can use: #! /usr/bin/perl use strict; use warnings; my $text=`lynx -dump htmlDocument.html`; print "$text"; ################################################################## #or lynx -dump htmlDocument.html > htmlDocument.txt

    I'm not really a human, but I play one on earth.
    Old Perl Programmer Haiku
Re: Count number of words on a web page?
by planetscape (Canon) on Feb 09, 2010 at 18:29 UTC

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://822189]
Approved by marto
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others surveying the Monastery: (5)
As of 2014-12-27 10:14 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (176 votes), past polls