Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot

TagCloud and phrase frequency

by johnnywang (Priest)
on Aug 14, 2007 at 19:10 UTC ( #632574=perlquestion: print w/replies, xml ) Need Help??
johnnywang has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I'm trying to create a tagcloud from the collection of texts. That is, given a set of text (e.g., sentences), I'll need to count frequencies for the phrases. It's pretty easy to count frequencies of individual words, but I'm wondering whether there is a good algorithm for counting phrases. Here's a naiive way to count them:
my %words = (); my %phrases = (); while(<DATA>){ chomp; my @words = split/\s+/; #count words ++$words{lc $_} foreach @words; # count phrases for(my $i = 0; $i < @words; ++$i){ for(my $j = $i; $j < @words; ++$j){ ++$phrases{lc join(" ", @words[$i..$j])}; } } } print "Words:\n"; foreach my $wd (sort {$words{$b}<=>$words{$a}}keys %words) { print "$wd => $words{$wd}\n"; } print "\n\nPhrases:\n"; foreach my $p (sort {$phrases{$b}<=>$phrases{$a}}keys %phrases) { print "$p => $phrases{$p}\n"; } __DATA__ Mary had a little lamb little lamb John had a lamb Mary and John both had a lamb Mary and John had two little lambs
For my particular case, each sentence is about 50 words long, and there can be up to a few thousand such sentences. Related to this, is there a good way to rule out the common words (e.g., "a", "the", etc.)?

Replies are listed 'Best First'.
Re: TagCloud and phrase frequency
by Fletch (Chancellor) on Aug 14, 2007 at 21:41 UTC

    As for your second query, what you want to remove are known as "stop words"; use that phrase as googlefodder, and see Lingua::StopWords.

Re: TagCloud and phrase frequency
by BrowserUk (Pope) on Aug 15, 2007 at 11:03 UTC
Re: TagCloud and phrase frequency
by Anonymous Monk on Aug 15, 2007 at 07:49 UTC
    Hello Our software Textanz counts frequencies of everything : words, phrases and wordforms in text. If interested, visit Textanz page : Regards, Alexander Potyomkin Cro-Code

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://632574]
Approved by grep
Front-paged by Old_Gray_Bear
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others examining the Monastery: (4)
As of 2018-02-19 14:57 GMT
Find Nodes?
    Voting Booth?
    When it is dark outside I am happiest to see ...

    Results (266 votes). Check out past polls.