Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

Find most frequently used word in text file.

by jonesd14 (Initiate)
on Dec 19, 2013 at 19:13 UTC ( #1067862=perlquestion: print w/ replies, xml ) Need Help??
jonesd14 has asked for the wisdom of the Perl Monks concerning the following question:

Hello, Monks (first post).

I briefly learned Perl in my Programming Languages course last spring. I thought it was great, but didn't use it much since. I'm wanting to start learning the language again, and learn it correctly. I'm aware that there are many different ways to do anything in Perl, so let me know if anything can be improved or be more "Perl-ish" in my code below:

The point of this code is to read a .txt file and find the most frequently used word, and report the number of times it's used. If there are multiple words used the most frequently, the program just chooses the first match. The program seems to currently work fine, but as I said, please let me know if anything can be improved upon!

## # FILE: mostFreqWord.pl # AUTHOR: Daniel Jones # CREATED: 12/18/2013 # MODIFIED: 21/18/2013 ## die "ERROR: Must enter one file name.\n" unless $#ARGV == 0; open FILE, "<", $ARGV[0] or die "Could not open $ARGV[0] for reading.\ +n"; #the hash to contain the word-count pairs. my %hash; my @lines = <FILE>; #go through each line in the file foreach my $line(@lines){ #skip non-word characters my @words = split /\W+/, $line; #go through each word in the file foreach my $word(@words){ chomp $word; #force all words to lowercase $word = lc $word; #remove beginning/trailing whitespace $word =~ s/^\s+|\s+$//g; my $key = $file.$word; #if the word exists, increment its value. #otherwise, set it to 1. if(exists $hash{$key}){ $hash{$key}++; } else{ $hash{$key} = 1; } } } close FILE; my @values; #get the values from the hash foreach my $key(keys %hash){ push @values, $hash{$key}; } @values = sort @values; my @keys = keys %hash; my $idx = 0; my $bestVal = @values[-1]; my $bestKey; foreach my $key(@keys){ if ($hash{$key} == $bestVal){ $bestKey = $key; last; } } print "The most frequent word in $ARGV[0] is $bestKey, which was seen +$bestVal times.\n";

Comment on Find most frequently used word in text file.
Download Code
Re: Find most frequently used word in text file.
by NetWallah (Abbot) on Dec 19, 2013 at 19:30 UTC
    Try this alternative code, and let us know if you need help understanding it. Please indicate exactly what part you need help with :
    perl -anE "for (@F){s/\W//g;$_ or next; $h{$_}++}}{$k=(sort {$h{$a} <= +> $h{$b}} keys %h)[-1]; say qq|The most frequent word in $ARGV is $k, + which was seen $h{$k} times.|" **YOUR-FILE-NAME**
    (Use single quotes, if run on linux).

                 When in doubt, mumble; when in trouble, delegate; when in charge, ponder. -- James H. Boren

Re: Find most frequently used word in text file.
by roboticus (Canon) on Dec 19, 2013 at 19:44 UTC

    jonesd14:

    The first half is pretty good. There are a few quibbles, but nothing bad. For the second half, though, you're doing *far* too much work to get the best key/value from the hash. I'dd suggest something more like this:

    my ($bestVal, $bestKey) = (-1); foreach my $key (@keys) { if ($hash{$key} > $bestVal) { ($bestVal, $bestKey) = ($hash{$key}, $key); } } print "The most frequent word in $ARGV[0] is $bestKey, which was seen +$bestVal times.\n";

    Now on to a few of the quibbles:

    • Your variable names are good except one two: %hash and @arrays.
    • chomp $word; is unnecessary, since you just split the string up at non-word characters. Similarly, the regex substitution to remove whitespace is redundant.
    • If your file is big, you may run out of memory because you're reading the entire file at once. You might try:
      while (my $line = <FILE>) {
      This has the additional advantage of removing the need for the @arrays variable.

    I didn't see any real problems, just unnecessary work. Nice work!

    Update: Updated first quibble.

    ...roboticus

    When your only tool is a hammer, all problems look like your thumb.

    k
Re: Find most frequently used word in text file.
by Kenosis (Priest) on Dec 19, 2013 at 19:47 UTC

    If there are multiple words used the most frequently, the program just chooses the first match.

    Perhaps the following doesn't address your general question, but it provides the option of showing all words with the highest frequency:

    use strict; use warnings; use List::Util qw/max/; my ( %words, %count ); my $file = $ARGV[0]; while (<>) { $words{ lc $_ }++ for split /\W+/; } push @{ $count{ $words{$_} } }, $_ for keys %words; my $max = max keys %count; print "The most frequent word(s) in $file: @{ $count{$max} }; Times se +en: $max.\n"

    This uses a hash to store word/count pairs. Next, it creates a hash of arrays (HoA), where the key is the count and the value is a reference to a list of words associated with that count. It then uses List::Util to find the max count, and that count is used to display the word(s) and frequency.

    Hope this helps!

Re: Find most frequently used word in text file.
by toolic (Chancellor) on Dec 19, 2013 at 19:51 UTC
Re: Find most frequently used word in text file.
by Laurent_R (Parson) on Dec 19, 2013 at 22:54 UTC

    Hi,

    I am not claiming that my style is any better, but my code is definitely much shorter and it might give you some ideas for the future. The following is taken from a tutorial I wrote some time ago in French on the use of list operators. The code does pretty much exactly what you want. Additional information is that I decided to remove accents from the French text to be studied. I also changed some variable names into English, I hope that I did not introduce a bug doing so.

    #!/usr/bin/perl use strict; use warnings; my %words; my $text; { local $/ = undef; $text = <>; } $words{$_}++ foreach map {$_= lc; tr//aaeeeeiouuc/; $_;} split /[,.:;"?!'\n ]+/, $text; print map {$words{$_}, "\t$_\n"} sort {$words{$b} <=> $words{$a} || $a cmp $b} keys %words;
    My sorting logic is slightly different from yours: sort by descending frequency and, if the same frequency, by ascending "asciibetical" order.

    Applying this program on an old (public domain) French translation of the Bible (the full text, both Old and New Testaments, i.e. about 32,000 verses and a bit more than 710,000 words), I obtained the following histogram:

    33093 de 31980 et 19813 la 18170 a 18132 l 17535 le 16774 les 12391 il 10103 qui 9844 des 9492 d [...]
    I then had the idea that all these very short words were not very interesting for linguistic (or semantic, or theological, or historical) analysis of the text, so I decided to "grep out" words with two characters or less by changing the relevant code to:
    $words{$_}++ foreach map { $_= lc; tr//aaeeeeiiouuc/; $_;} grep {length > 2} split /[,.:;"?!'\n ]+/, $texte;
    This now gives me the following beginning of histogram: Just in case you wanted to know, the program runs on the full Bible text in less that two seconds on my laptop.

    Please feel free to ask if you need information on how this works.

Re: Find most frequently used word in text file.
by Bloodnok (Vicar) on Dec 20, 2013 at 09:18 UTC
    In addition to the bug pointed out by toolic, I believe that there is a further bug in as much as it [your code] will merely result in a frequency report for the first line, not the whole file as you expect - the line my @lines  = <FILE>; won't slurp in the whole file, it'll merely result in only the first line i.e. all text up to the first end of line char, being read in, you need to undef the line delimiter as in Laurent_R's code i.e. undef $/;.

    HTH ,

    A user level that continues to overstate my experience :-))

      This is not correct. From perlop under I/O operators:

      If a <FILEHANDLE> is used in a context that is looking for a list, a list comprising all input lines is returned, one line per list element.
        Aha, TFT hdb - good spot - it's early in the morning (for me) and my contexts are somewhat confused :-(

        My comment would be applicable if the OP was trying to slurp a file into a scalar, which they aren't. I'm not at all sure i.e. can't remember, if I've yet attempted to slurp a file into a list.

        A user level that continues to overstate my experience :-))
Re: Find most frequently used word in text file.
by dcmertens (Beadle) on Dec 20, 2013 at 13:24 UTC

    Others have addressed your question. I have one philosophical point to make:

    ... there are many different ways to do anything in Perl...

    There are many different ways to do anything in any programming language. It just so happens that Perl was designed in light of this fact, not in spite of it, and the Perl community embraces diversity.

Re: Find most frequently used word in text file.
by Anonymous Monk on Feb 20, 2014 at 09:20 UTC
    Can you show me how to make a list with for example the 50 most common words?

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1067862]
Approved by Old_Gray_Bear
Front-paged by toolic
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others pondering the Monastery: (5)
As of 2014-08-29 18:53 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The best computer themed movie is:











    Results (286 votes), past polls