Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

How do I count the frequency of words in a file and save them for later?

by ghenry (Vicar)
on May 17, 2005 at 13:11 UTC ( #457784=perlquestion: print w/replies, xml ) Need Help??

ghenry has asked for the wisdom of the Perl Monks concerning the following question:

You want to search a file for unique words, count them and print a summary, but also saving these results for later use.

Using a simple database (a dbm file), you could do (adapted from Perl Cookbook, 2ed):

#!/usr/bin/perl use strict; use warnings; # Basename of Simple Database file my $DBFILE = 'wordcount_db'; # open database, accessed through %WORDS dbmopen (my %WORDS, $DBFILE, 0666) or die "Can't open $DBFILE: $!\n"; # Make a word frequency counter while (<>) { while ( /(\w['\w-]*)/g ) { $WORDS{lc $1}++; } } # Output hash in a descending numeric sort of its values foreach my $word ( sort { $WORDS{$b} <=> $WORDS{$a} } keys %WORDS) { printf "%5d %s\n", $WORDS{$word}, $word; } # Close the database dbmclose %WORDS;

TIMTOWTDI however.

Originally posted as a Categorized Question.

  • Comment on How do I count the frequency of words in a file and save them for later?
  • Download Code

Replies are listed 'Best First'.
Re: How do I count the frequency of words in a file and save them for later?
by whakka (Hermit) on Feb 04, 2009 at 17:43 UTC
    Using a one-liner (and same formatting as previous posts):
    $ perl -nle '$w{$_}++ for grep /\w/, map { s/[\. ,]*$//g; lc($_) } spl +it; sub END { printf("%7d\t%s\n", $c, $w) while (($w,$c) = each(%w)) +}' files...
Re: How do I count the frequency of words in a file and save them for later?
by planetscape (Chancellor) on May 26, 2005 at 01:31 UTC

    I usually use this script and pipe the results to a text file:

    #!/usr/local/bin/perl # $Id: wordfreq.perl,v 1.13 2001/05/16 23:46:40 doug Exp $ # http://www.bagley.org/~doug/shootout/ <= old URL; dead now # http://dada.perl.it/shootout/wordfreq.perl.html <= URL as of time th +is post was written # Tony Bowden suggested using tr versus lc and split(/[^a-z]/) use strict; my %count = (); while (read(STDIN, $_, 4095) and $_ .= <STDIN>) { tr/A-Za-z/ /cs; ++$count{$_} foreach split(' ', lc $_); } my @lines = (); my ($w, $c); push(@lines, sprintf("%7d\t%s\n", $c, $w)) while (($w, $c) = each(%cou +nt)); print sort { $b cmp $a } @lines;

    planetscape

      There have been - and probably will be - quite a few posts regarding word counts. This solution doesn't work. To give just one example, "can't" ends up as 1 "can" and 1 "t". Other solutions often have it as "cant", but what is really needed is testing to see if the apostrophe has at least one letter on each side. Also, what about end of line word splits? A word like:

      google-
      plex

      Should be converted to googleplex before counting. I imagine there are one or two other things to program in as well.

      I'm not saying this is necessarily a bad place to start, but you need to program in some modifications. Better get cracking.

Re: How do I count the frequency of words in a file and save them for later?
by rcaputo (Chaplain) on Feb 05, 2009 at 03:12 UTC

    The standard UNIX tool chain works fine:

    perl -nle "print for /(\w['\w-]*)/g" input.text | sort | uniq -c | sor +t -rn | tee word-list.text
Re: How do I count the frequency of words in a file and save them for later?
by amitbhosale (Acolyte) on Feb 13, 2008 at 09:40 UTC
    In my example, output is in the form of a perl hash(ref) structure. This lets you load it easily in another program using do.
    my %seen=(); while(<>) { chomp; foreach my $word ( grep /\w/, split ) { $word =~ s/[. ,]*$//; # strip off punctuation, etc. $seen{$word}++; } } use Data::Dumper; $Data::Dumper::Terse = 1; print Dumper \%seen;
    For example, given an input file containing:
    Click on a letter above to see phrasal verbs beginning with that lette +r. You will get a list of phrasal verbs and their definitions. If you the +n click on an individual verb, you will get more information on it, including example sentences, whether it is British or American English +, and whether it is separable or not.
    Output looks like:
    { 'you' => 2, 'a' => 2, 'not' => 1, 'that' => 1, 'sentences' => 1, 'individual' => 1, 'see' => 1, 'on' => 3, 'American' => 1, 'or' => 2, 'verb' => 1, 'Click' => 1, 'list' => 1, 'English' => 1, 'letter' => 2, 'their' => 1, 'whether' => 2, 'with' => 1, 'and' => 2, 'verbs' => 2, 'of' => 1, 'is' => 2, 'definitions' => 1, 'to' => 1, 'above' => 1, 'will' => 2, 'If' => 1, 'get' => 2, 'including' => 1, 'beginning' => 1, 'it' => 3, 'example' => 1, 'information' => 1, 'separable' => 1, 'British' => 1, 'click' => 1, 'phrasal' => 2, 'then' => 1, 'You' => 1, 'more' => 1, 'an' => 1 }

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://457784]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others lurking in the Monastery: (5)
As of 2021-04-21 06:11 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?