Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

TagCloud and phrase frequency

by johnnywang (Priest)
on Aug 14, 2007 at 19:10 UTC ( #632574=perlquestion: print w/ replies, xml ) Need Help??
johnnywang has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I'm trying to create a tagcloud from the collection of texts. That is, given a set of text (e.g., sentences), I'll need to count frequencies for the phrases. It's pretty easy to count frequencies of individual words, but I'm wondering whether there is a good algorithm for counting phrases. Here's a naiive way to count them:
my %words = (); my %phrases = (); while(<DATA>){ chomp; my @words = split/\s+/; #count words ++$words{lc $_} foreach @words; # count phrases for(my $i = 0; $i < @words; ++$i){ for(my $j = $i; $j < @words; ++$j){ ++$phrases{lc join(" ", @words[$i..$j])}; } } } print "Words:\n"; foreach my $wd (sort {$words{$b}<=>$words{$a}}keys %words) { print "$wd => $words{$wd}\n"; } print "\n\nPhrases:\n"; foreach my $p (sort {$phrases{$b}<=>$phrases{$a}}keys %phrases) { print "$p => $phrases{$p}\n"; } __DATA__ Mary had a little lamb little lamb John had a lamb Mary and John both had a lamb Mary and John had two little lambs
For my particular case, each sentence is about 50 words long, and there can be up to a few thousand such sentences. Related to this, is there a good way to rule out the common words (e.g., "a", "the", etc.)?

Comment on TagCloud and phrase frequency
Download Code
Re: TagCloud and phrase frequency
by Fletch (Chancellor) on Aug 14, 2007 at 21:41 UTC

    As for your second query, what you want to remove are known as "stop words"; use that phrase as googlefodder, and see Lingua::StopWords.

Re: TagCloud and phrase frequency
by Anonymous Monk on Aug 15, 2007 at 07:49 UTC
    Hello Our software Textanz counts frequencies of everything : words, phrases and wordforms in text. If interested, visit Textanz page : http://www.cro-code.com/textanz.jsp Regards, Alexander Potyomkin Cro-Code
Re: TagCloud and phrase frequency
by BrowserUk (Pope) on Aug 15, 2007 at 11:03 UTC

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://632574]
Approved by grep
Front-paged by Old_Gray_Bear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chanting in the Monastery: (13)
As of 2014-11-27 11:17 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My preferred Perl binaries come from:














    Results (184 votes), past polls