Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things

Re: Find most frequently used word in text file.

by Laurent_R (Canon)
on Dec 19, 2013 at 22:54 UTC ( #1067896=note: print w/replies, xml ) Need Help??

in reply to Find most frequently used word in text file.


I am not claiming that my style is any better, but my code is definitely much shorter and it might give you some ideas for the future. The following is taken from a tutorial I wrote some time ago in French on the use of list operators. The code does pretty much exactly what you want. Additional information is that I decided to remove accents from the French text to be studied. I also changed some variable names into English, I hope that I did not introduce a bug doing so.

#!/usr/bin/perl use strict; use warnings; my %words; my $text; { local $/ = undef; $text = <>; } $words{$_}++ foreach map {$_= lc; tr//aaeeeeiouuc/; $_;} split /[,.:;"?!'\n ]+/, $text; print map {$words{$_}, "\t$_\n"} sort {$words{$b} <=> $words{$a} || $a cmp $b} keys %words;
My sorting logic is slightly different from yours: sort by descending frequency and, if the same frequency, by ascending "asciibetical" order.

Applying this program on an old (public domain) French translation of the Bible (the full text, both Old and New Testaments, i.e. about 32,000 verses and a bit more than 710,000 words), I obtained the following histogram:

33093 de 31980 et 19813 la 18170 a 18132 l 17535 le 16774 les 12391 il 10103 qui 9844 des 9492 d [...]
I then had the idea that all these very short words were not very interesting for linguistic (or semantic, or theological, or historical) analysis of the text, so I decided to "grep out" words with two characters or less by changing the relevant code to:
$words{$_}++ foreach map { $_= lc; tr//aaeeeeiiouuc/; $_;} grep {length > 2} split /[,.:;"?!'\n ]+/, $texte;
This now gives me the following beginning of histogram:
16774 les 10103 qui 9844 des 9112 que 7350 est 6966 eternel 6826 dans 6336 vous 6284 pour 5931 ils 4546 pas 4272 sur 4176 dieu 4161 fils 4041 lui 3864 dit 3808 une 3510 son 3349 avec 3184 nous 3091 car 2993 par 2958 ses 2924 comme 2793 leur 2602 israel 2590 mais 2563 roi 2548 tous 2418 mon 2293 point 2255 ton 2120 tout 2069 sont 2046 elle 1949 maison 1910 leurs 1856 avait 1846 toi 1800 homme 1799 pays 1784 peuple 1773 etait 1736 moi 1668 ceux 1642 aux 1591 tes 1580 devant 1517 plus 1513 celui 1474 fait
Just in case you wanted to know, the program runs on the full Bible text in less that two seconds on my laptop.

Please feel free to ask if you need information on how this works.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1067896]
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others wandering the Monastery: (5)
As of 2018-05-26 15:58 GMT
Find Nodes?
    Voting Booth?