compute the occurrence of words

BigGer has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: compute the occurrence of words by Utilitarian (Vicar) on Feb 13, 2013 at 13:34 UTC
Please explain what the line `$data=<FH>` does and why it is there? Secondly, why would you want to enumerate occurrences of unique things in any other way than as a hash? You're new to Perl and so it seems an odd data type, but once you get used to using them hashes are a phenomenally useful tool. `print "Good ",qw(night morning afternoon evening)[(localtime)[2]/6]," fellow monks."`	[reply] [d/l] [select]
Re^2: compute the occurrence of words by vinoth.ree (Monsignor) on Feb 13, 2013 at 13:57 UTC
Hi Utilitarian yes I agree that hash is the best way to find the unique things, but BigGer is not trying to find the unique words, he tried in his code to counts the occurrence of each word.	[reply]
Re^3: compute the occurrence of words by Utilitarian (Vicar) on Feb 13, 2013 at 14:13 UTC
Perhaps I was unclear, one of the things you have to do in order to count the number of occurrences of each word in a given text is to create a list of the unique words present, the other thing you have to do is associate a count with each of these words. If you can think of a data structure more suited to this purpose than an associative array or hash, I'd be very interested. `print "Good ",qw(night morning afternoon evening)[(localtime)[2]/6]," fellow monks."`	[reply] [d/l]
Re: compute the occurrence of words ("under stand arrays") by ww (Archbishop) on Feb 13, 2013 at 14:32 UTC
First: I second Utilitarian's (implied) advice re hashes. Second: The shebang line is unnecessary under windows unless you have trailing modifies ( like `-T` or `-w` ) and definitely doesn't need escaped backslashes. Third: these are a couple ways to do PART of what your code tries to do -- count the words. #!/usr/bin/perl use 5.016; use strict; use warnings; #1018535 =head NO ARRAY METHOD (everything between here and the =cut line is +commented out) my $count; while ( <DATA> ) { for ( $_ =~ /\w+/g ) { $count++; } } print "COUNT: $count\n"; =cut END of (POD-syntax code which is tantamount to a) multi-line + comment # you could stuff the individual (not necessarily unique) words into a +n array, this way: # ie, this uses an ARRAY. Run the code to see what's happened here. my $i = 0; my ( $count, @count); while ( <DATA> ) { for ( $_ =~ /(\w+)/g ) { push @count, "$_ $i"; $i ++; } } for $count(@count) { say "Array element is: $count"; } __DATA__ This these that the and how who writ this code how now brown cow the fox jumped into the hencoop the lazy brown dog was azleep. [download] Note that the highest numeric value from the array method is 25, even though there are 26 words in the __DATA__ section. Look at the first value of the output and you'll find a zero... which is the way array elements are denominated -- the first element is $array[0]. If you wish to see the NON-ARRAY method used, move the `=head` and `=cut` down to surround the second (and currently operative) code. The data section must, of course, stay outside this hackish method of doing a multi-line comment in a script. BUT, to identify and count unique words using the commonly recommended approach, use a hash. And finally, if by any chance you mean you're seeing things like `HASH(0x213F7)` when you mention "numeric values," (though I don't see why) what you're seeing is effectively a pointer to the memory where the actual values are stored; to understand that (and how to get the actual values) you'll need to study referencing and dereferencing, in our estimable Tutorials section. If you didn't program your executable by toggling in binary, it wasn't really programming!	[reply] [d/l] [select]
Re^2: compute the occurrence of words ("under stand arrays") by BigGer (Novice) on Feb 13, 2013 at 15:14 UTC
That's Brilliant I have both your examples working thanks. I will take your advise and read up on using hash. Thanks Again. G	[reply]
Re: compute the occurrence of words by roboticus (Chancellor) on Feb 13, 2013 at 15:33 UTC
BigGer: You can do it with arrays, but the problem fits a hash much better. Let's take a look at a straightforward array implementation: `WORD: while (my $word = pop @words) { # lower-case it $word = lc $word; # search for the entry for my $index (0 .. $#words) { if ($counta[$index][0] eq $word) { # when found, update the count and # go to the next word $counta[$index][1]++; next WORD; } } # If there's no entry, add a new entry push @counta, [ $word, 1 ]; }` [download] As you can see, we go through the list of words. For each word, we convert it to lower case, then search for the entry containing the word. If we find the entry, we increment it, otherwise we create a new entry. So why does the problem fit a hash much better? Locating the entry is much easier and faster in a hash, because we can locate the entry by name instead of digging through the array. The data structure is simpler. In an array-based implementation, you have to use arrays in each slot to hold both the word and the count. In the hash implementation, the word is the key so the value only has to hold the count. Finally, for the array version, you need to explicitly create a new entry when the one you're looking for doesn't exist. With a hash, the act of looking up the value creates a new entry for you automatically if it doesn't exist. This is known as autovivification. The equivalent hash version looks like this: `WORD: while (my $word = pop @words) { # lower-case it $word = lc $word; # search for the entry. When found, update the # count, if not found create new entry. $counth{$word}++; }` [download] The speed advantages of the hash version are significant. Here's a comparison of an array version, hash version and a greparray version. (I was wondering if grep might be a faster way to search the array than a linear search.) `$ perl t.pl *** Comparing a list of 100 words 10000 times * Rate greparray array hash greparray 997/s -- -55% -94% array 2210/s 122% -- -86% hash 16026/s 1507% 625% -- * Comparing a list of 1000 words 1000 times *** Rate greparray array hash greparray 8.80/s -- -79% -100% array 41.5/s 372% -- -98% hash 2070/s 23419% 4887% --` [download] As you can see, as the number of words increases, so does the speed advantage of the hash version. (The code for the test is in the readmore tag....) Read more... (3 kB) ...roboticus When your only tool is a hammer, all problems look like your thumb.	[reply] [d/l] [select]
Re: compute the occurrence of words by tmharish (Friar) on Feb 13, 2013 at 15:10 UTC
Rewriting all of whats mentioned above in what I think might help a person totally new to Perl: use strict ; use warnings ; my %count ; while( my $line = <DATA> ) { # Read lines from DATA, you can # replace this with a file handle ( F +H ). # First break down a single line into words - # We assume that words are white space separated. # To include others such as '-' you woould replace # /\s/ with /[\s-]/ my @words_in_this_line = split( /\s/, $line ) ; # Now we flip through the words within a single line. foreach my $word ( @words_in_this_line ) { # Lowercase it to ensure that repeats in different # cases are not recounted. $word = lc( $word ) ; # Check if there is a number contained in this word, # we move to the 'next' iteration if there is # Notice that the condition is after the statement # that is executed if the condition is True. next if( $word =~ /\d+/ ) ; if( defined( $count{ $word } ) ) { # If I have seen the word before then increment my count. $count{ $word } ++ ; } else { # What if I have never seen this word - Then I need to set cou +nt as 1; $count{ $word } = 1 ; } } # End of loopint through words. } # End of looping through lines in file. # Your - sort { $count{$b} <=> $count{$a} \|\| $a cmp $b} keys %count # Lets break it up: # We stored it so the key is the word and the value the count # This ordering was intentional so as to ensure that we can 'quickly +' # figure out if we have seen a word before. my @uniq_words_in_file = keys %count ; # We use the brilliant sort function that allows you to tell it what t +he # comparison should be. @uniq_words_in_file = sort( { $count{$b} <=> $count{$a} \|\| $a cmp $b } @uniq_words_in_file ) ; # This one bit brings out the beauty of Perl: # We are passing a Subroutine to the subroutine 'sort' # 'sort' will use this sub to compare elements during the sort. # notice that <=> will return -1, 0 or 1 and when # $count{ $b } is equal to $count{ $a }, '<=>' will return 0. # # Now every line in evaluates to a value and Perl uses Lazy evaluation +. # What this means is that as it evaluates a boolean 'OR' it will # stop evaluating expressions after it finds a true value # ( because True OR anything is always True ) # # We use this to additionally compare $a and $b as strings this time # when the counts are equal. # And now the printing. foreach my $word ( @uniq_words_in_file ) { print "'$word'\tOccurred\t$count{ $word }\ttimes\n"; } __DATA__ This these that the and how who writ this code 1 how now brown cow 1asdf 23 the fox jumped into 123 the hencoop the lazy brown 2134 dog was azleep. [download] And now the code again with no comments: use strict ; use warnings ; my %count ; while( my $line = <DATA> ) { my @words_in_this_line = split( /\s/, $line ) ; foreach my $word ( @words_in_this_line ) { $word = lc( $word ) ; next if( $word =~ /\d+/ ) ; if( defined( $count{ $word } ) ) { $count{ $word } ++ ; } else { $count{ $word } = 1 ; } } # End of loopint through words. } # End of looping through lines in file. my @uniq_words_in_file = keys %count ; @uniq_words_in_file = sort( { $count{$b} <=> $count{$a} \|\| $a cmp $b } @uniq_words_in_file ) ; foreach my $word ( @uniq_words_in_file ) { print "'$word'\tOccurred\t$count{ $word }\ttimes\n"; } __DATA__ This these that the and how who writ this code 1 how now brown cow 1asdf 23 the fox jumped into 123 the hencoop the lazy brown 2134 dog was azleep. [download]	[reply] [d/l] [select]
Re^2: compute the occurrence of words by BigGer (Novice) on Feb 13, 2013 at 15:20 UTC
Thanks for taking the time to comment the code its really helpful and make for a clearer understanding. G	[reply]
Re: compute the occurrence of words by vinoth.ree (Monsignor) on Feb 13, 2013 at 14:05 UTC
*Also at the moment the code is returning numeric values which I need to exclude.* Then what you expect from this code? It gives the word and its count.	[reply]
Re^2: compute the occurrence of words by BigGer (Novice) on Feb 13, 2013 at 14:17 UTC
the line `$data = <FH>;` Is an error and I have removed it. I am looking to count the occurrences of each word used in a document but excluding numbers. Hope that clarifies my question. G	[reply] [d/l]
Re^3: compute the occurrence of words by Tux (Canon) on Feb 13, 2013 at 14:27 UTC
In which case you will also have to define "numbers" :) integers?, floats? e-notation? Roman? Only ASCII-digits, or also other Unicode numerals? Let me assume simple integers and floats represented in ASCII (no triad-sep, radix-sep = '`.`', so valid numbers include `1234` and `0.23`, but not `DCVII`, `2.34e12` or `1,234,567.00` `my %count; while (<FH>) { $count{lc $_}++ for grep { !m{^[0-9]+(\.[0-9]+)?$} } m/\w+/g; }` [download] For a complete regular expression to integers and reals, I'd like to refer to Regexp::Common (see `$RE{num}`). update: /me just realized that it is overly complex, as `\w+` can only match integers without a triad-sep, as `.` is not included in `\w`, reducing the loop-line to `$count{lc $_}++ for grep { !m{^[0-9]+$} } m/^\w+$/g;` [download] Enjoy, Have FUN! H.Merijn	[reply] [d/l] [select]
Re^4: compute the occurrence of words by BigGer (Novice) on Feb 13, 2013 at 14:40 UTC
Re^3: compute the occurrence of words by AnomalousMonk (Archbishop) on Feb 13, 2013 at 14:25 UTC
... count ... but excluding numbers. This just confuses me. Can you provide a small input list of words and a corresponding output list showing the non-numeric 'count' you desire for the given input?	[reply]


Keep It Simple, Stupid
	PerlMonks