Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

Count number of words on a web page?

by jdlev (Scribe)
on Feb 09, 2010 at 13:33 UTC ( [id://822189]=perlquestion: print w/replies, xml ) Need Help??

jdlev has asked for the wisdom of the Perl Monks concerning the following question:

Is there a simple way to count the number of times a word appears on a web page?
I love it when a program comes together - jdhannibal

Replies are listed 'Best First'.
Re: Count number of words on a web page?
by JSchmitz (Canon) on Feb 09, 2010 at 13:39 UTC
    #!/usr/bin/perl $/ = ""; $* = 1; while (<>) { #s/-\n//g; tr/A-Z/a-z/; @words = split(/\W*\s+\W*/, $_); # split into words foreach $word (@words) { $wordcount{$word}++; # count the words } } foreach $word (sort keys(%wordcount)) { printf "%8d\t\t%s\n", $wordcount{$word}, $word; }

    Hope that helps

    Jeffery
      Use word boundaries (\b).
      $_ ='The,quick,brown;foxy. Does a lot,of:jumping!'; my @words = split(/\W*\s+\W*/, $_); # split into words print 'Number of words: '.@words."\n";
      gives 4 words. However
      my @words = split(/\b\W*/, $_); # split into words
      gives 9 words.
      Thanks!
      I love it when a program comes together - jdhannibal
Re: Count number of words on a web page?
by zentara (Archbishop) on Feb 09, 2010 at 17:57 UTC
    I would add that you might want to extract pure text from the html page, before word counting.... otherwise you will get html artifacts in your text count, like brackets and slashes. Read perldoc -q 'remove html'

    Other ideas:

    #!/usr/bin/perl #You could use HTML::TokeParser::Simple and only print text tags. #almost straight from the TokeParser::Simple POD use HTML::TokeParser::Simple; my $p = HTML::TokeParser::Simple->new( $somefile ); while ( my $token = $p->get_token ) { print $token->as_is if $token->is_text; } ################################################################### #HTML::Strip - Perl extension for stripping HTML markup from text. use HTML::Strip; my $hs = HTML::Strip->new(); my $clean_text = $hs->parse( $raw_html ); $hs->eof; ################################################################### sub strip { my $html = shift; my $p = HTML::PullParser->new( doc => $html, text => 'text', ); my $result = ''; while ( my $t = $p->get_token ) { $result .= $t->[0]; } return $result; } ############################################################## #If you just need to strip all the html tags from a page, #and are on a platform with lynx, you can use: #! /usr/bin/perl use strict; use warnings; my $text=`lynx -dump htmlDocument.html`; print "$text"; ################################################################## #or lynx -dump htmlDocument.html > htmlDocument.txt

    I'm not really a human, but I play one on earth.
    Old Perl Programmer Haiku
Re: Count number of words on a web page?
by planetscape (Chancellor) on Feb 09, 2010 at 18:29 UTC

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://822189]
Approved by marto
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others imbibing at the Monastery: (4)
As of 2024-04-16 04:27 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found