Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

How do I count the frequency of words in a file and save them for later?

( #457784=categorized question: print w/ replies, xml ) Need Help??
Contributed by ghenry on May 17, 2005 at 13:11 UTC
Q&A  > files


Description:

You want to search a file for unique words, count them and print a summary, but also saving these results for later use.

Using a simple database (a dbm file), you could do (adapted from Perl Cookbook, 2ed):

#!/usr/bin/perl use strict; use warnings; # Basename of Simple Database file my $DBFILE = 'wordcount_db'; # open database, accessed through %WORDS dbmopen (my %WORDS, $DBFILE, 0666) or die "Can't open $DBFILE: $!\n"; # Make a word frequency counter while (<>) { while ( /(\w['\w-]*)/g ) { $WORDS{lc $1}++; } } # Output hash in a descending numeric sort of its values foreach my $word ( sort { $WORDS{$b} <=> $WORDS{$a} } keys %WORDS) { printf "%5d %s\n", $WORDS{$word}, $word; } # Close the database dbmclose %WORDS;

TIMTOWTDI however.

Answer: How do I count the frequency of words in a file and save them for later?
contributed by whakka

Using a one-liner (and same formatting as previous posts):

$ perl -nle '$w{$_}++ for grep /\w/, map { s/[\. ,]*$//g; lc($_) } spl +it; sub END { printf("%7d\t%s\n", $c, $w) while (($w,$c) = each(%w)) +}' files...
Answer: How do I count the frequency of words in a file and save them for later?
contributed by planetscape

I usually use this script and pipe the results to a text file:

#!/usr/local/bin/perl # $Id: wordfreq.perl,v 1.13 2001/05/16 23:46:40 doug Exp $ # http://www.bagley.org/~doug/shootout/ <= old URL; dead now # http://dada.perl.it/shootout/wordfreq.perl.html <= URL as of time th +is post was written # Tony Bowden suggested using tr versus lc and split(/[^a-z]/) use strict; my %count = (); while (read(STDIN, $_, 4095) and $_ .= <STDIN>) { tr/A-Za-z/ /cs; ++$count{$_} foreach split(' ', lc $_); } my @lines = (); my ($w, $c); push(@lines, sprintf("%7d\t%s\n", $c, $w)) while (($w, $c) = each(%cou +nt)); print sort { $b cmp $a } @lines;

planetscape

Answer: How do I count the frequency of words in a file and save them for later?
contributed by rcaputo

The standard UNIX tool chain works fine:

perl -nle "print for /(\w['\w-]*)/g" input.text | sort | uniq -c | sor +t -rn | tee word-list.text
Answer: How do I count the frequency of words in a file and save them for later?
contributed by amitbhosale

In my example, output is in the form of a perl hash(ref) structure. This lets you load it easily in another program using do.

my %seen=(); while(<>) { chomp; foreach my $word ( grep /\w/, split ) { $word =~ s/[. ,]*$//; # strip off punctuation, etc. $seen{$word}++; } } use Data::Dumper; $Data::Dumper::Terse = 1; print Dumper \%seen;
For example, given an input file containing:
Click on a letter above to see phrasal verbs beginning with that lette +r. You will get a list of phrasal verbs and their definitions. If you the +n click on an individual verb, you will get more information on it, including example sentences, whether it is British or American English +, and whether it is separable or not.
Output looks like:
{ 'you' => 2, 'a' => 2, 'not' => 1, 'that' => 1, 'sentences' => 1, 'individual' => 1, 'see' => 1, 'on' => 3, 'American' => 1, 'or' => 2, 'verb' => 1, 'Click' => 1, 'list' => 1, 'English' => 1, 'letter' => 2, 'their' => 1, 'whether' => 2, 'with' => 1, 'and' => 2, 'verbs' => 2, 'of' => 1, 'is' => 2, 'definitions' => 1, 'to' => 1, 'above' => 1, 'will' => 2, 'If' => 1, 'get' => 2, 'including' => 1, 'beginning' => 1, 'it' => 3, 'example' => 1, 'information' => 1, 'separable' => 1, 'British' => 1, 'click' => 1, 'phrasal' => 2, 'then' => 1, 'You' => 1, 'more' => 1, 'an' => 1 }

Please (register and) log in if you wish to add an answer



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?
    Username:
    Password:

    What's my password?
    Create A New User
    Chatterbox?
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others scrutinizing the Monastery: (5)
    As of 2015-07-31 02:05 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?

      The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









      Results (274 votes), past polls