Re: Find most frequently used word in text file.

in reply to Find most frequently used word in text file.

Hi,

I am not claiming that my style is any better, but my code is definitely much shorter and it might give you some ideas for the future. The following is taken from a tutorial I wrote some time ago in French on the use of list operators. The code does pretty much exactly what you want. Additional information is that I decided to remove accents from the French text to be studied. I also changed some variable names into English, I hope that I did not introduce a bug doing so.

#!/usr/bin/perl
use strict;
use warnings;

my %words;
my $text;
{
    local $/ = undef; 
    $text = <>; 
}
$words{$_}++ foreach 
            map {$_= lc; tr/авйиклофщыз/aaeeeeiouuc/; $_;} 
            split /[,.:;"?!'\n ]+/, $text;
print map {$words{$_}, "\t$_\n"} 
      sort {$words{$b} <=> $words{$a} || $a cmp $b} 
      keys %words;
[download]

My sorting logic is slightly different from yours: sort by descending frequency and, if the same frequency, by ascending "asciibetical" order.

Applying this program on an old (public domain) French translation of the Bible (the full text, both Old and New Testaments, i.e. about 32,000 verses and a bit more than 710,000 words), I obtained the following histogram:

33093   de
31980   et
19813   la
18170   a
18132   l
17535   le
16774   les
12391   il
10103   qui
9844    des
9492    d
[...]
[download]

I then had the idea that all these very short words were not very interesting for linguistic (or semantic, or theological, or historical) analysis of the text, so I decided to "grep out" words with two characters or less by changing the relevant code to:

$words{$_}++ foreach
            map { $_= lc; tr/авйиклпофщыз/aaeeeeiiouuc/; $_;}
            grep {length > 2} 
            split /[,.:;"?!'\n ]+/, $texte;
[download]

This now gives me the following beginning of histogram:

16774   les
10103   qui
9844    des
9112    que
7350    est
6966    eternel
6826    dans
6336    vous
6284    pour
5931    ils
4546    pas
4272    sur
4176    dieu
4161    fils
4041    lui
3864    dit
3808    une
3510    son
3349    avec
3184    nous
3091    car
2993    par
2958    ses
2924    comme
2793    leur
2602    israel
2590    mais
2563    roi
2548    tous
2418    mon
2293    point
2255    ton
2120    tout
2069    sont
2046    elle
1949    maison
1910    leurs
1856    avait
1846    toi
1800    homme
1799    pays
1784    peuple
1773    etait
1736    moi
1668    ceux
1642    aux
1591    tes
1580    devant
1517    plus
1513    celui
1474    fait
[download]

Just in case you wanted to know, the program runs on the full Bible text in less that two seconds on my laptop.

Please feel free to ask if you need information on how this works.

Comment on Re: Find most frequently used word in text file. Select or Download Code

In Section Seekers of Perl Wisdom