Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

TagCloud and phrase frequency

by johnnywang (Priest)
on Aug 14, 2007 at 19:10 UTC ( #632574=perlquestion: print w/ replies, xml ) Need Help??
johnnywang has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I'm trying to create a tagcloud from the collection of texts. That is, given a set of text (e.g., sentences), I'll need to count frequencies for the phrases. It's pretty easy to count frequencies of individual words, but I'm wondering whether there is a good algorithm for counting phrases. Here's a naiive way to count them:
my %words = (); my %phrases = (); while(<DATA>){ chomp; my @words = split/\s+/; #count words ++$words{lc $_} foreach @words; # count phrases for(my $i = 0; $i < @words; ++$i){ for(my $j = $i; $j < @words; ++$j){ ++$phrases{lc join(" ", @words[$i..$j])}; } } } print "Words:\n"; foreach my $wd (sort {$words{$b}<=>$words{$a}}keys %words) { print "$wd => $words{$wd}\n"; } print "\n\nPhrases:\n"; foreach my $p (sort {$phrases{$b}<=>$phrases{$a}}keys %phrases) { print "$p => $phrases{$p}\n"; } __DATA__ Mary had a little lamb little lamb John had a lamb Mary and John both had a lamb Mary and John had two little lambs
For my particular case, each sentence is about 50 words long, and there can be up to a few thousand such sentences. Related to this, is there a good way to rule out the common words (e.g., "a", "the", etc.)?

Comment on TagCloud and phrase frequency
Download Code
Re: TagCloud and phrase frequency
by Fletch (Chancellor) on Aug 14, 2007 at 21:41 UTC

    As for your second query, what you want to remove are known as "stop words"; use that phrase as googlefodder, and see Lingua::StopWords.

Re: TagCloud and phrase frequency
by Anonymous Monk on Aug 15, 2007 at 07:49 UTC
    Hello Our software Textanz counts frequencies of everything : words, phrases and wordforms in text. If interested, visit Textanz page : http://www.cro-code.com/textanz.jsp Regards, Alexander Potyomkin Cro-Code
Re: TagCloud and phrase frequency
by BrowserUk (Pope) on Aug 15, 2007 at 11:03 UTC

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://632574]
Approved by grep
Front-paged by Old_Gray_Bear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others rifling through the Monastery: (8)
As of 2014-10-25 17:08 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    For retirement, I am banking on:










    Results (146 votes), past polls