http://www.perlmonks.org?node_id=632574

johnnywang has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I'm trying to create a tagcloud from the collection of texts. That is, given a set of text (e.g., sentences), I'll need to count frequencies for the phrases. It's pretty easy to count frequencies of individual words, but I'm wondering whether there is a good algorithm for counting phrases. Here's a naiive way to count them:
my %words = (); my %phrases = (); while(<DATA>){ chomp; my @words = split/\s+/; #count words ++$words{lc $_} foreach @words; # count phrases for(my $i = 0; $i < @words; ++$i){ for(my $j = $i; $j < @words; ++$j){ ++$phrases{lc join(" ", @words[$i..$j])}; } } } print "Words:\n"; foreach my $wd (sort {$words{$b}<=>$words{$a}}keys %words) { print "$wd => $words{$wd}\n"; } print "\n\nPhrases:\n"; foreach my $p (sort {$phrases{$b}<=>$phrases{$a}}keys %phrases) { print "$p => $phrases{$p}\n"; } __DATA__ Mary had a little lamb little lamb John had a lamb Mary and John both had a lamb Mary and John had two little lambs
For my particular case, each sentence is about 50 words long, and there can be up to a few thousand such sentences. Related to this, is there a good way to rule out the common words (e.g., "a", "the", etc.)?

Replies are listed 'Best First'.
Re: TagCloud and phrase frequency
by Fletch (Bishop) on Aug 14, 2007 at 21:41 UTC

    As for your second query, what you want to remove are known as "stop words"; use that phrase as googlefodder, and see Lingua::StopWords.

Re: TagCloud and phrase frequency
by BrowserUk (Patriarch) on Aug 15, 2007 at 11:03 UTC
A reply falls below the community's threshold of quality. You may see it by logging in.