Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

How do I count the frequency of words in a file and save them for later?

( #457784=categorized question: print w/ replies, xml ) Need Help??
Contributed by ghenry on May 17, 2005 at 13:11 UTC
Q&A  > files


Description:

You want to search a file for unique words, count them and print a summary, but also saving these results for later use.

Using a simple database (a dbm file), you could do (adapted from Perl Cookbook, 2ed):

#!/usr/bin/perl use strict; use warnings; # Basename of Simple Database file my $DBFILE = 'wordcount_db'; # open database, accessed through %WORDS dbmopen (my %WORDS, $DBFILE, 0666) or die "Can't open $DBFILE: $!\n"; # Make a word frequency counter while (<>) { while ( /(\w['\w-]*)/g ) { $WORDS{lc $1}++; } } # Output hash in a descending numeric sort of its values foreach my $word ( sort { $WORDS{$b} <=> $WORDS{$a} } keys %WORDS) { printf "%5d %s\n", $WORDS{$word}, $word; } # Close the database dbmclose %WORDS;

TIMTOWTDI however.

Answer: How do I count the frequency of words in a file and save them for later?
contributed by whakka

Using a one-liner (and same formatting as previous posts):

$ perl -nle '$w{$_}++ for grep /\w/, map { s/[\. ,]*$//g; lc($_) } spl +it; sub END { printf("%7d\t%s\n", $c, $w) while (($w,$c) = each(%w)) +}' files...
Answer: How do I count the frequency of words in a file and save them for later?
contributed by planetscape

I usually use this script and pipe the results to a text file:

#!/usr/local/bin/perl # $Id: wordfreq.perl,v 1.13 2001/05/16 23:46:40 doug Exp $ # http://www.bagley.org/~doug/shootout/ <= old URL; dead now # http://dada.perl.it/shootout/wordfreq.perl.html <= URL as of time th +is post was written # Tony Bowden suggested using tr versus lc and split(/[^a-z]/) use strict; my %count = (); while (read(STDIN, $_, 4095) and $_ .= <STDIN>) { tr/A-Za-z/ /cs; ++$count{$_} foreach split(' ', lc $_); } my @lines = (); my ($w, $c); push(@lines, sprintf("%7d\t%s\n", $c, $w)) while (($w, $c) = each(%cou +nt)); print sort { $b cmp $a } @lines;

planetscape

Answer: How do I count the frequency of words in a file and save them for later?
contributed by rcaputo

The standard UNIX tool chain works fine:

perl -nle "print for /(\w['\w-]*)/g" input.text | sort | uniq -c | sor +t -rn | tee word-list.text
Answer: How do I count the frequency of words in a file and save them for later?
contributed by amitbhosale

In my example, output is in the form of a perl hash(ref) structure. This lets you load it easily in another program using do.

my %seen=(); while(<>) { chomp; foreach my $word ( grep /\w/, split ) { $word =~ s/[. ,]*$//; # strip off punctuation, etc. $seen{$word}++; } } use Data::Dumper; $Data::Dumper::Terse = 1; print Dumper \%seen;
For example, given an input file containing:
Click on a letter above to see phrasal verbs beginning with that lette +r. You will get a list of phrasal verbs and their definitions. If you the +n click on an individual verb, you will get more information on it, including example sentences, whether it is British or American English +, and whether it is separable or not.
Output looks like:
{ 'you' => 2, 'a' => 2, 'not' => 1, 'that' => 1, 'sentences' => 1, 'individual' => 1, 'see' => 1, 'on' => 3, 'American' => 1, 'or' => 2, 'verb' => 1, 'Click' => 1, 'list' => 1, 'English' => 1, 'letter' => 2, 'their' => 1, 'whether' => 2, 'with' => 1, 'and' => 2, 'verbs' => 2, 'of' => 1, 'is' => 2, 'definitions' => 1, 'to' => 1, 'above' => 1, 'will' => 2, 'If' => 1, 'get' => 2, 'including' => 1, 'beginning' => 1, 'it' => 3, 'example' => 1, 'information' => 1, 'separable' => 1, 'British' => 1, 'click' => 1, 'phrasal' => 2, 'then' => 1, 'You' => 1, 'more' => 1, 'an' => 1 }

Please (register and) log in if you wish to add an answer



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • Outside of code tags, you may need to use entities for some characters:
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?
    Username:
    Password:

    What's my password?
    Create A New User
    Chatterbox?
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others chilling in the Monastery: (6)
    As of 2014-12-22 04:15 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?

      Is guessing a good strategy for surviving in the IT business?





      Results (110 votes), past polls