Contributed by ghenry
on May 17, 2005 at 13:11 UTC
Q&A
> files
Description: You want to search a file for unique words, count them and print a summary, but also saving these results for later use.
Using a simple database (a dbm file), you could do (adapted from Perl Cookbook, 2ed):
#!/usr/bin/perl
use strict;
use warnings;
# Basename of Simple Database file
my $DBFILE = 'wordcount_db';
# open database, accessed through %WORDS
dbmopen (my %WORDS, $DBFILE, 0666)
or die "Can't open $DBFILE: $!\n";
# Make a word frequency counter
while (<>) {
while ( /(\w['\w-]*)/g ) {
$WORDS{lc $1}++;
}
}
# Output hash in a descending numeric sort of its values
foreach my $word ( sort { $WORDS{$b} <=> $WORDS{$a} } keys %WORDS) {
printf "%5d %s\n", $WORDS{$word}, $word;
}
# Close the database
dbmclose %WORDS;
TIMTOWTDI however. Answer: How do I count the frequency of words in a file and save them for later? contributed by planetscape I usually use this script and pipe the results to a text file:
#!/usr/local/bin/perl
# $Id: wordfreq.perl,v 1.13 2001/05/16 23:46:40 doug Exp $
# http://www.bagley.org/~doug/shootout/ <= old URL; dead now
# http://dada.perl.it/shootout/wordfreq.perl.html <= URL as of time th
+is post was written
# Tony Bowden suggested using tr versus lc and split(/[^a-z]/)
use strict;
my %count = ();
while (read(STDIN, $_, 4095) and $_ .= <STDIN>) {
tr/A-Za-z/ /cs;
++$count{$_} foreach split(' ', lc $_);
}
my @lines = ();
my ($w, $c);
push(@lines, sprintf("%7d\t%s\n", $c, $w)) while (($w, $c) = each(%cou
+nt));
print sort { $b cmp $a } @lines;
planetscape | Answer: How do I count the frequency of words in a file and save them for later? contributed by whakka Using a one-liner (and same formatting as previous posts):
$ perl -nle '$w{$_}++ for grep /\w/, map { s/[\. ,]*$//g; lc($_) } spl
+it; sub END { printf("%7d\t%s\n", $c, $w) while (($w,$c) = each(%w))
+}' files...
| Answer: How do I count the frequency of words in a file and save them for later? contributed by rcaputo The standard UNIX tool chain works fine:
perl -nle "print for /(\w['\w-]*)/g" input.text | sort | uniq -c | sor
+t -rn | tee word-list.text
| Answer: How do I count the frequency of words in a file and save them for later? contributed by amitbhosale In my example, output is in the form of a perl hash(ref) structure.
This lets you load it easily in another program using do.
my %seen=();
while(<>)
{
chomp;
foreach my $word ( grep /\w/, split )
{
$word =~ s/[. ,]*$//; # strip off punctuation, etc.
$seen{$word}++;
}
}
use Data::Dumper;
$Data::Dumper::Terse = 1;
print Dumper \%seen;
For example, given an input file containing:
Click on a letter above to see phrasal verbs beginning with that lette
+r.
You will get a list of phrasal verbs and their definitions. If you the
+n
click on an individual verb, you will get more information on it,
including example sentences, whether it is British or American English
+,
and whether it is separable or not.
Output looks like:
{
'you' => 2,
'a' => 2,
'not' => 1,
'that' => 1,
'sentences' => 1,
'individual' => 1,
'see' => 1,
'on' => 3,
'American' => 1,
'or' => 2,
'verb' => 1,
'Click' => 1,
'list' => 1,
'English' => 1,
'letter' => 2,
'their' => 1,
'whether' => 2,
'with' => 1,
'and' => 2,
'verbs' => 2,
'of' => 1,
'is' => 2,
'definitions' => 1,
'to' => 1,
'above' => 1,
'will' => 2,
'If' => 1,
'get' => 2,
'including' => 1,
'beginning' => 1,
'it' => 3,
'example' => 1,
'information' => 1,
'separable' => 1,
'British' => 1,
'click' => 1,
'phrasal' => 2,
'then' => 1,
'You' => 1,
'more' => 1,
'an' => 1
}
|
Please (register and) log in if you wish to add an answer
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
|
For: |
|
Use: |
| & | | & |
| < | | < |
| > | | > |
| [ | | [ |
| ] | | ] |
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.
|
Log In?
|
|
Chatterbox?
|
How do I use this? | Other CB clients
|
Other Users?
|
Others imbibing at the Monastery: (4) As of 2021-03-01 05:03 GMT
|
Sections?
|
|
Information?
|
|
Find Nodes?
|
|
Leftovers?
|
|
Voting Booth?
|
No recent polls found
|
Notices?
|
|
|