TagCloud and phrase frequency

johnnywang has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I'm trying to create a tagcloud from the collection of texts. That is, given a set of text (e.g., sentences), I'll need to count frequencies for the phrases. It's pretty easy to count frequencies of individual words, but I'm wondering whether there is a good algorithm for counting phrases. Here's a naiive way to count them:

my %words = ();
my %phrases = ();

while(<DATA>){
    chomp;
    my @words = split/\s+/;
    
    #count words
    ++$words{lc $_} foreach @words;

    # count phrases
    for(my $i = 0; $i < @words; ++$i){
        for(my $j = $i; $j < @words; ++$j){
           ++$phrases{lc join(" ", @words[$i..$j])};
        }
    }
}
print "Words:\n";
foreach my $wd (sort {$words{$b}<=>$words{$a}}keys %words) {
    print "$wd => $words{$wd}\n";
}
print "\n\nPhrases:\n";
foreach my $p (sort {$phrases{$b}<=>$phrases{$a}}keys %phrases) {
    print "$p => $phrases{$p}\n";
}

__DATA__
Mary had a little lamb
little lamb
John had a lamb
Mary and John both had a lamb
Mary and John had two little lambs
[download]

For my particular case, each sentence is about 50 words long, and there can be up to a few thousand such sentences. Related to this, is there a good way to rule out the common words (e.g., "a", "the", etc.)?

Comment on TagCloud and phrase frequency Download Code

Replies are listed 'Best First'.
Re: TagCloud and phrase frequency by Fletch (Bishop) on Aug 14, 2007 at 21:41 UTC
As for your second query, what you want to remove are known as "stop words"; use that phrase as googlefodder, and see Lingua::StopWords.	[reply]
Re: TagCloud and phrase frequency by BrowserUk (Patriarch) on Aug 15, 2007 at 11:03 UTC
There might be something in Re^5: Comparing text documents and its thread that would lend itself to this purpose. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply]
A reply falls below the community's threshold of quality. You may see it by logging in.

Back to Seekers of Perl Wisdom