Re: compute the occurrence of words

in reply to compute the occurrence of words

You *can* do it with arrays, but the problem fits a hash much better. Let's take a look at a straightforward array implementation:

WORD: while (my $word = pop @words) {
   # lower-case it
   $word = lc $word;

   # search for the entry
   for my $index (0 .. $#words) {
      if ($counta[$index][0] eq $word) {
         # when found, update the count and 
         # go to the next word
         $counta[$index][1]++;
         next WORD;
      }
   }

   # If there's no entry, add a new entry
   push @counta, [ $word, 1 ];
}
[download]

As you can see, we go through the list of words. For each word, we convert it to lower case, then search for the entry containing the word. If we find the entry, we increment it, otherwise we create a new entry.

So why does the problem fit a hash much better?

Locating the entry is much easier and faster in a hash, because we can locate the entry by name instead of digging through the array.
The data structure is simpler. In an array-based implementation, you have to use arrays in each slot to hold both the word and the count. In the hash implementation, the word is the key so the value only has to hold the count.
Finally, for the array version, you need to explicitly create a new entry when the one you're looking for doesn't exist. With a hash, the act of looking up the value creates a new entry for you automatically if it doesn't exist. This is known as autovivification.

The equivalent hash version looks like this:

WORD: while (my $word = pop @words) {
   # lower-case it
   $word = lc $word;

   # search for the entry.  When found, update the
   # count, if not found create new entry.
   $counth{$word}++;
}
[download]

The speed advantages of the hash version are significant. Here's a comparison of an array version, hash version and a greparray version. (I was wondering if grep might be a faster way to search the array than a linear search.)

$ perl t.pl

***** Comparing a list of 100 words 10000 times *****

             Rate greparray     array      hash
greparray   997/s        --      -55%      -94%
array      2210/s      122%        --      -86%
hash      16026/s     1507%      625%        --

***** Comparing a list of 1000 words 1000 times *****

            Rate greparray     array      hash
greparray 8.80/s        --      -79%     -100%
array     41.5/s      372%        --      -98%
hash      2070/s    23419%     4887%        --
[download]

As you can see, as the number of words increases, so does the speed advantage of the hash version. (The code for the test is in the readmore tag....)


#!/usr/bin/perl
use strict;
use warnings;
use Benchmark qw(:all);
use Data::Dumper;

# Dump counts? 1=yes, 0=no
my $showall = 1;

my $letters = join("", "A" .. "Z", "a" .. "z");
my @words;
my %counth;
my @counta;
my @countg;

sub do_hash {
    %counth = ();
    for my $word (@words) {
        $word = lc $word;
        $counth{$word}++;
    }
}

sub do_array {
    @counta = ();
    WORD: for my $word (@words) {
        $word = lc $word;
        for my $index (0 .. $#counta) {
            if ($counta[$index][0] eq $word) {
                $counta[$index][1]++;
                next WORD;
            }
        }
        push @counta, [ $word, 1 ];
    }
}

sub do_greparray {
    @countg = ();
    WORD: for my $word (@words) {
        $word = lc $word;
        my @entries = grep { $word eq $$_[1] } @countg;
        if (@entries) {
            $entries[0][1]++;
        }
        else {
            push @countg, [ $word, 1 ];
        }
    }
}

sub make_word {
    # Half the time, reuse a word
    if (@words and 0.5 > rand) {
        return $words[int rand @words];
    }

    my $len = int( rand 3 * rand 3 )+1;
    my $word;
    $word .= substr($letters, int(rand length $letters), 1) for 1 .. $
+len;
    return $word;
}

for my $num_words (100, 1000, 10000, 100000, 1000000) {
    push @words, make_word() for 1 .. $num_words;
    my $num_iterations = 1000000 / $num_words;

    print "\n***** Comparing a list of $num_words words "
    .     "$num_iterations times *****\n\n";
    cmpthese($num_iterations, {
        'hash'=>\&do_hash,
        'array'=>\&do_array,
        'greparray'=>\&do_greparray,
    });
}

=h1

# For debugging, enable this chunk (by removing the =h1 and =cut lines
+).
# It'll show any differences between the final run of the three method
+s.

for my $ar (@counta) {
    my ($word, $cnta) = @$ar;
    my $cnth = $counth{$word};
    my @ag = grep { $word eq $$_[0] } @countg;
    my $cntg = $ag[0][1] // -1;
    next if $showall or $cnta != $cnth or $cnta != $cntg;
    printf "%-10.10s % 6u % 6u % 6u\n",
           $word, $cnth, $cnta, $cntg;
}

=cut
[download]

...roboticus

When your only tool is a hammer, all problems look like your thumb.

In Section Seekers of Perl Wisdom