Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses

TagCloud and phrase frequency

by johnnywang (Priest)
on Aug 14, 2007 at 19:10 UTC ( #632574=perlquestion: print w/replies, xml ) Need Help??

johnnywang has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I'm trying to create a tagcloud from the collection of texts. That is, given a set of text (e.g., sentences), I'll need to count frequencies for the phrases. It's pretty easy to count frequencies of individual words, but I'm wondering whether there is a good algorithm for counting phrases. Here's a naiive way to count them:
my %words = (); my %phrases = (); while(<DATA>){ chomp; my @words = split/\s+/; #count words ++$words{lc $_} foreach @words; # count phrases for(my $i = 0; $i < @words; ++$i){ for(my $j = $i; $j < @words; ++$j){ ++$phrases{lc join(" ", @words[$i..$j])}; } } } print "Words:\n"; foreach my $wd (sort {$words{$b}<=>$words{$a}}keys %words) { print "$wd => $words{$wd}\n"; } print "\n\nPhrases:\n"; foreach my $p (sort {$phrases{$b}<=>$phrases{$a}}keys %phrases) { print "$p => $phrases{$p}\n"; } __DATA__ Mary had a little lamb little lamb John had a lamb Mary and John both had a lamb Mary and John had two little lambs
For my particular case, each sentence is about 50 words long, and there can be up to a few thousand such sentences. Related to this, is there a good way to rule out the common words (e.g., "a", "the", etc.)?

Replies are listed 'Best First'.
Re: TagCloud and phrase frequency
by Fletch (Bishop) on Aug 14, 2007 at 21:41 UTC

    As for your second query, what you want to remove are known as "stop words"; use that phrase as googlefodder, and see Lingua::StopWords.

Re: TagCloud and phrase frequency
by BrowserUk (Patriarch) on Aug 15, 2007 at 11:03 UTC
A reply falls below the community's threshold of quality. You may see it by logging in.

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://632574]
Approved by grep
Front-paged by Old_Gray_Bear
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (7)
As of 2023-01-30 11:09 GMT
Find Nodes?
    Voting Booth?

    No recent polls found