Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?

Re: compute the occurrence of words

by roboticus (Chancellor)
on Feb 13, 2013 at 15:33 UTC ( #1018564=note: print w/replies, xml ) Need Help??

in reply to compute the occurrence of words


You *can* do it with arrays, but the problem fits a hash much better. Let's take a look at a straightforward array implementation:

WORD: while (my $word = pop @words) { # lower-case it $word = lc $word; # search for the entry for my $index (0 .. $#words) { if ($counta[$index][0] eq $word) { # when found, update the count and # go to the next word $counta[$index][1]++; next WORD; } } # If there's no entry, add a new entry push @counta, [ $word, 1 ]; }

As you can see, we go through the list of words. For each word, we convert it to lower case, then search for the entry containing the word. If we find the entry, we increment it, otherwise we create a new entry.

So why does the problem fit a hash much better?

  • Locating the entry is much easier and faster in a hash, because we can locate the entry by name instead of digging through the array.
  • The data structure is simpler. In an array-based implementation, you have to use arrays in each slot to hold both the word and the count. In the hash implementation, the word is the key so the value only has to hold the count.
  • Finally, for the array version, you need to explicitly create a new entry when the one you're looking for doesn't exist. With a hash, the act of looking up the value creates a new entry for you automatically if it doesn't exist. This is known as autovivification.

The equivalent hash version looks like this:

WORD: while (my $word = pop @words) { # lower-case it $word = lc $word; # search for the entry. When found, update the # count, if not found create new entry. $counth{$word}++; }

The speed advantages of the hash version are significant. Here's a comparison of an array version, hash version and a greparray version. (I was wondering if grep might be a faster way to search the array than a linear search.)

$ perl ***** Comparing a list of 100 words 10000 times ***** Rate greparray array hash greparray 997/s -- -55% -94% array 2210/s 122% -- -86% hash 16026/s 1507% 625% -- ***** Comparing a list of 1000 words 1000 times ***** Rate greparray array hash greparray 8.80/s -- -79% -100% array 41.5/s 372% -- -98% hash 2070/s 23419% 4887% --

As you can see, as the number of words increases, so does the speed advantage of the hash version. (The code for the test is in the readmore tag....)

#!/usr/bin/perl use strict; use warnings; use Benchmark qw(:all); use Data::Dumper; # Dump counts? 1=yes, 0=no my $showall = 1; my $letters = join("", "A" .. "Z", "a" .. "z"); my @words; my %counth; my @counta; my @countg; sub do_hash { %counth = (); for my $word (@words) { $word = lc $word; $counth{$word}++; } } sub do_array { @counta = (); WORD: for my $word (@words) { $word = lc $word; for my $index (0 .. $#counta) { if ($counta[$index][0] eq $word) { $counta[$index][1]++; next WORD; } } push @counta, [ $word, 1 ]; } } sub do_greparray { @countg = (); WORD: for my $word (@words) { $word = lc $word; my @entries = grep { $word eq $$_[1] } @countg; if (@entries) { $entries[0][1]++; } else { push @countg, [ $word, 1 ]; } } } sub make_word { # Half the time, reuse a word if (@words and 0.5 > rand) { return $words[int rand @words]; } my $len = int( rand 3 * rand 3 )+1; my $word; $word .= substr($letters, int(rand length $letters), 1) for 1 .. $ +len; return $word; } for my $num_words (100, 1000, 10000, 100000, 1000000) { push @words, make_word() for 1 .. $num_words; my $num_iterations = 1000000 / $num_words; print "\n***** Comparing a list of $num_words words " . "$num_iterations times *****\n\n"; cmpthese($num_iterations, { 'hash'=>\&do_hash, 'array'=>\&do_array, 'greparray'=>\&do_greparray, }); } =h1 # For debugging, enable this chunk (by removing the =h1 and =cut lines +). # It'll show any differences between the final run of the three method +s. for my $ar (@counta) { my ($word, $cnta) = @$ar; my $cnth = $counth{$word}; my @ag = grep { $word eq $$_[0] } @countg; my $cntg = $ag[0][1] // -1; next if $showall or $cnta != $cnth or $cnta != $cntg; printf "%-10.10s % 6u % 6u % 6u\n", $word, $cnth, $cnta, $cntg; } =cut


When your only tool is a hammer, all problems look like your thumb.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1018564]
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others romping around the Monastery: (4)
As of 2018-05-27 20:53 GMT
Find Nodes?
    Voting Booth?