Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Re: Code for generating a word frequency count not working

by graff (Chancellor)
on Nov 24, 2014 at 05:36 UTC ( [id://1108223]=note: print w/replies, xml ) Need Help??


in reply to Code for generating a word frequency count not working

Apart from the very good comments above, I would add:

(1) This sort of process is usually better off without having specific file names hard-coded in the script. You can use one or more command-line args for input files, or run some other command that prints text to stdout and pipe that to your script's STDIN. If your script prints results to STDOUT, you can either use redirection on the command line to create an output file (i.e.: your_script.pl some_files*.txt > word_hist.txt) or pipe the output to some other process.

(2) GrandFather already pointed out a different sorting method, but I think it's better to sort numerically, and then format the numbers for output. (If you really want leading zeros in the output, that's fine and easy, but you don't need to do that just to sort the output.) Also, for sets of words that occur with the same frequency, it's often useful to have them listed in alphabetical order.

(3) The OP method of conditioning the text will work fine so long as your input data is always ASCII-only text, but if you happen to end up with data that contains things like "pie à la mode" or "naïve", your results will be inaccurate (à won't be counted at all, and naïve will be counted as two "words", na and ve). In this case, you need to know what character encoding is being used (utf8?, cp1252? something else?), and decode the input accordingly.

Taking those points into account (and assuming utf8 as the most likely case for non-ASCII content):

#!/usr/bin/perl use strict; use warnings; use diagnostics; use open IN => ':utf8'; binmode STDIN, ':utf8'; binmode STDOUT, ':utf8'; my %freq; while (<>) { # reads from STDIN or from all file names in @ARGV $_ = lc(); s/[^a-z'-]+/ /g; for my $word ( split ) { $freq{$word}++; } } for ( sort { $freq{$b} <=> $freq{$a} || $a cmp $b } keys %freq ) { printf "%05d %s\n", $freq{$_}, $_; # or to list results on larger data sets without leading zeros: # printf "%9d %s\n", $freq{$_}, $_; }
(UPDATE: I was tempted to add a line or two inside the for my $word ( split ) loop, to remove initial and final apostrophes from each word - that's 'cuz some folks' typing habits include using apostrophes as single quotes - but sometimes an initial or final apostrophe should be treated as 'part of the word'. It's up to you how you want to handle that.)

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1108223]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others avoiding work at the Monastery: (4)
As of 2024-04-19 03:48 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found