http://www.perlmonks.org?node_id=475805

ghettofinger has asked for the wisdom of the Perl Monks concerning the following question:

Wise monks

I am trying to put a script together that will go through a text file and show me the top 5 words sorted my occurrence. Here is where I am at:

while (<>) { chop; my @words = split; foreach $wd (@words) { next if length($wd) < 5; $count{$wd}++; } } foreach $w (keys %count) { print "$count{$w} $w\n"; }

As you can see, I am only interested in words over 5 characters in length. I am unsure how to go about sorting this by number of occurrence. Also, I am having problems with punctuation showing up in my results. I have an example below:

2 usurpations, 2 purpose 1 people. 1 obstructed 1 formidable 1 obstructing 1 uncomfortable,

Is this because I am using split? Is there a better way to go about this. I am sure I will start missing words that have apostrophes too. Also, how should I sort this? Then how can I take only the top 5?

Any help or advice is appreciated.

Many thanks,
ghettofinger

Replies are listed 'Best First'.
Re: Top five words by occurrence
by ikegami (Pope) on Jul 18, 2005 at 16:45 UTC
    how should I sort this? Then how can I take only the top 5?
    my $top = 5; foreach $w (sort { $count{$b} <=> $count{$a} } keys %count) { last if $top--; print "$count{$w} $w\n"; }

    or

    my $top = 5; foreach $w ( (sort { $count{$b} <=> $count{$a} } keys %count)[0..$top-1] ) { print "$count{$w} $w\n"; }

    If you need to handle ties, the following lists more than 5 if there are ties for the 5th spot:

    my $top = 5; my $last_count = -1; foreach $w (sort { $count{$b} <=> $count{$a} } keys %count) { last if $top == 1 and $last_count != $count{$w}; $top--; $last_count = $count{$w}; print "$count{$w} $w\n"; }

    I'll let someone else help you to extract words.

      last if $top--;
      In your first example this would evaluate to 'last if true', thus exiting the loop before the first print statement would get called.
      Shouldn't that be last unless $top--; or am I overlooking something?
        aye, you're right.
Re: Top five words by occurrence
by Tanktalus (Canon) on Jul 18, 2005 at 16:46 UTC

    Two things - first, it'd be handy to have your sample input as well. Second, what you probably want to do is continue using split, but then just clear off all trailing punctuation:

    foreach my $wd (@words) { $wd =~ s/[[:punct:]]+$//; next if length($wd) < 5; $count{$wd}++; }
    This will remove trailing apostrophes, such as "wha' zup?", but not ones in the middle of words, such as "ain't".

    Oh, I lied - three things. Use strict and warnings. Just do. I added a "my" to the foreach above for strictness purposes - do the same for your other foreach.

    Once you have all this, you should be able to sort on value, foreach my $w (sort { $count{$a} <=> $count{$b} } keys %count) { ... } to get everything in order. What you do with it after that depends on how you want to deal with multiple words in 5th place.

    PS: best use of split with no parameters I've ever seen. I don't think that's even close to your problem, but is, in fact, the best part of your code. Good job. :-)

      Here is the test I am going through:
      The Declaration of Independence of the Thirteen Colonies In CONGRESS, July 4, 1776 <readmore> The unanimous Declaration of the thirteen united States of America, When in the Course of human events, it becomes necessary for one peopl +e to dissolve the political bands which have connected them with anot +her, and to assum e among the powers of the earth, the separate and equal station to whi +ch the Laws of Nature and of Nature's God entitle them, a decent resp +ect to the opinio ns of mankind requires that they should declare the causes which impel + them to the separation. We hold these truths to be self-evident, that all men are created equa +l, that they are endowed by their Creator with certain unalienable Ri +ghts, that among these are Life, Liberty and the pursuit of Happiness. --That to secure + these rights, Governments are instituted among Men, deriving their j +ust powers from t he consent of the governed, --That whenever any Form of Government bec +omes destructive of these ends, it is the Right of the People to alte +r or to abolish i t, and to institute new Government, laying its foundation on such prin +ciples and organizing its powers in such form, as to them shall seem +most likely to ef fect their Safety and Happiness. Prudence, indeed, will dictate that G +overnments long established should not be changed for light and trans +ient causes; and accordingly all experience hath shewn, that mankind are more disposed +to suffer, while evils are sufferable, than to right themselves by ab +olishing the form s to which they are accustomed. But when a long train of abuses and us +urpations, pursuing invariably the same Object evinces a design to re +duce them under a bsolute Despotism, it is their right, it is their duty, to throw off s +uch Government, and to provide new Guards for their future security. +—Such has been th e patient sufferance of these Colonies; and such is now the necessity +which constrains them to alter their former Systems of Government. Th +e history of the present King of Great Britain [George III] is a history of repeated in +juries and usurpations, all having in direct object the establishment + of an absolute T yranny over these States. To prove this, let Facts be submitted to a c +andid world. He has refused his Assent to Laws, the most wholesome and necessary fo +r the public good. He has forbidden his Governors to pass Laws of immediate and pressing +importance, unless suspended in their operation till his Assent shoul +d be obtained; an d when so suspended, he has utterly neglected to attend to them. He has refused to pass other Laws for the accommodation of large distr +icts of people, unless those people would relinquish the right of Rep +resentation in th e Legislature, a right inestimable to them and formidable to tyrants o +nly. He has called together legislative bodies at places unusual, uncomfort +able, and distant from the depository of their public Records, for th +e sole purpose of fatiguing them into compliance with his measures. He has dissolved Representative Houses repeatedly, for opposing with m +anly firmness his invasions on the rights of the people. He has refused for a long time, after such dissolutions, to cause othe +rs to be elected; whereby the Legislative powers, incapable of Annihi +lation, have retu rned to the People at large for their exercise; the State remaining in + the mean time exposed to all the dangers of invasion from without, a +nd convulsions wi thin. He has endeavoured to prevent the population of these States; for that + purpose obstructing the Laws for Naturalization of Foreigners; refus +ing to pass other s to encourage their migrations hither, and raising the conditions of +new Appropriations of Lands. He has obstructed the Administration of Justice, by refusing his Assen +t to Laws for establishing Judiciary powers. He has made Judges dependent on his Will alone, for the tenure of thei +r offices, and the amount and payment of their salaries. He has erected a multitude of New Offices, and sent hither swarms of O +fficers to harass our p
Re: Top five words by occurrence
by socketdave (Curate) on Jul 18, 2005 at 16:54 UTC
    split is just dicing up your input by whitespace. A '$wd =~ s/\W//g;' before your '$count{$wd}++;' will wipe out anything other than letters and numbers (probably a bad idea if you need to deal with email addresses or URLs). You also may want to '$count{lc($wd)}++;' to ignore capitalization.

    Update:

    and as far as just getting the 5 most common words, you can just run the output of your script through:

    |sort -n|tail -n 5
Re: Top five words by occurrence
by mda2 (Hermit) on Jul 18, 2005 at 16:50 UTC
    To select only words you can use \W+ (see perlre non-words):

    ... my @words = split /\W+/; ...

    For show only first 5 occurs you can use this syntax:

    foreach $w ( ( sort { $count{$b} <=> $count{$a} } keys %count ) [0..4] ) { ... }
    .............

    --
    Marco Antonio
    Rio-PM

      Using \W is not a good suggestion at all. "a1" and "123" would be considered words, but not "isn't".
Re: Top five words by occurrence
by jimbojones (Friar) on Jul 18, 2005 at 16:54 UTC
    Hi,

    You might also want to look at Text::ExtractWords. I haven't used it, but it looks like it does what you want.

    - j

Re: Top five words by occurrence
by halley (Prior) on Jul 18, 2005 at 17:08 UTC
    While the code to do this sort of thing is very simple, you might find it even simpler to try Data::Favorites. The timestamping features might be superfluous for your needs.

    --
    [ e d @ h a l l e y . c c ]

Re: Top five words by occurrence
by jacques (Priest) on Jul 18, 2005 at 17:04 UTC
    Have you heard of the Unix program called uniq?

    It might help you here. (You can also pipe its output to a perl one-liner if you need more flexability.)

Re: Top five words by occurrence
by Anonymous Monk on Jul 19, 2005 at 11:04 UTC
    Is this because I am using split? Is there a better way to go about this.
    Yes. You're splitting on whitespace, and there's no whitespace between uncomfortable and its following comma. Instead of splitting on whitespace, you might want to extract sequences of word characters - instead of
    my @words = split;
    you'd write:
    my @words = /\w{5,}/g;
    with the added benefit of not having to test of word length anymore, you're extracting words consisting of at least 5 characters.
    I am sure I will start missing words that have apostrophes too.
    Indeed. Extracting word characters will miss words containing apostrophes. Or hyphens. Extracting words from a random text, where the words can contain punctuation is not a trivial thing to do.

    A'mous-Monk