|Perl: the Markov chain saw|
Re: Find most frequently used word in text file.by Laurent_R (Priest)
|on Dec 19, 2013 at 22:54 UTC||Need Help??|
I am not claiming that my style is any better, but my code is definitely much shorter and it might give you some ideas for the future. The following is taken from a tutorial I wrote some time ago in French on the use of list operators. The code does pretty much exactly what you want. Additional information is that I decided to remove accents from the French text to be studied. I also changed some variable names into English, I hope that I did not introduce a bug doing so.
My sorting logic is slightly different from yours: sort by descending frequency and, if the same frequency, by ascending "asciibetical" order.
Applying this program on an old (public domain) French translation of the Bible (the full text, both Old and New Testaments, i.e. about 32,000 verses and a bit more than 710,000 words), I obtained the following histogram:
I then had the idea that all these very short words were not very interesting for linguistic (or semantic, or theological, or historical) analysis of the text, so I decided to "grep out" words with two characters or less by changing the relevant code to:
This now gives me the following beginning of histogram:
Just in case you wanted to know, the program runs on the full Bible text in less that two seconds on my laptop.
Please feel free to ask if you need information on how this works.