Keys composed out of numerical indexes would save space ...
Not so! Using the text from your post as a representative abstract, and building the the 3 hashes using both the original words and integers representing those words I get the following hash sizes:
c:\test>774421-2
Using words: 44501
Using numbers: 40801
Giving a total saving just under 10%. But... If you add in the requirements of the array required to hold the ordered numbers--ordering is a requirement--plus a hash used to unique the words:
Numbers array: 12776
Lookup hash: 8860
The memory requirement is nearly 50% higher.
As for "and possibly search time.". As the hash keys are the concatenation of the string representations of the integers assigned, and hash keys are hashed byte-wise; and the average (English) words length is a little under 6-chars; and the average length of the integers required to represent them a little under 5-chars; the difference in the time taken to actually hash the keys will be very minimal. And as the number of pairs, triples & quads will be the same, the hashes will contain the same number of keys, so the lookup times will be very similar. These are the timings I got using the above sample:
Lookup words: 0.000440120697021484 seconds
Lookup numbers: 0.000699043273925781 seconds
Lookups words: 515; numbers:515
Lookup words: 0.000420093536376953 seconds
Lookup numbers: 0.000321149826049805 seconds
Lookups words: 515; numbers:515
Lookup words: 0.000285863876342773 seconds
Lookup numbers: 0.000767946243286133 seconds
Lookups words: 515; numbers:515
Too wildly varying to consider making an assessment.
The code used to produce the figures:
#! perl -sw
use 5.010;
use strict;
use Time::HiRes qw[ time ];
use Data::Dump qw[ pp ];
use Devel::Size qw[ total_size ];
my $abstract = do{ local $/; <DATA> };
my @words = split ' ', $abstract;
my %words;
$words{ doubles }{ join ' ', @words[ $_ .. $_ +1 ] } = undef
for 0 .. $#words - 1;
$words{ triples }{ join ' ', @words[ $_ .. $_ +2 ] } = undef
for 0 .. $#words - 2;
$words{ quads }{ join ' ', @words[ $_ .. $_ +3 ] } = undef
for 0 .. $#words - 3;
say "Using words: ", total_size \%words;
my( $n, %lookup ) = 0;
my @numbers = map $lookup{ $_ } //= ++$n, @words;
my %numbers;
$numbers{ doubles }{ join ' ', @numbers[ $_ .. $_ +1 ] } = undef
for 0 .. $#numbers - 1;
$numbers{ triples }{ join ' ', @numbers[ $_ .. $_ +2 ] } = undef
for 0 .. $#numbers - 2;
$numbers{ quads }{ join ' ', @numbers[ $_ .. $_ +3 ] } = undef
for 0 .. $#numbers - 3;
say "Using numbers: ", total_size \%numbers;
say "Numbers array: ", total_size \@numbers;
say "Lookup hash: ", total_size \%lookup;
my $start = time;
my $words = 0;
for my $size ( keys %words ) {
exists $words{ $size }{ $_ } and $words++ for keys %{ $words{ $siz
+e } };
}
say "Lookup words: ", time() - $start, ' seconds';
$start = time;
my $numbers = 0;
for my $size ( keys %words ) {
exists $numbers{ $size }{ $_ } and $numbers++
for keys %{ $numbers{ $size } };
}
say "Lookup numbers: ", time() - $start, ' seconds';
say "Lookups words: $words; numbers:$numbers";
__DATA__
There are as I see it two scalability issues in the above code.
The first of these has been addressed well by the various
suggestions to use a sliding window ($i...$i+$n on the array of words)
+.
The second is the keys themselves. You are currently using the actual
+words
in each phrase as a key. This means that searching for all sequences o
+f N-M
words in a text with K word characters (e.g. alpha-numerics) could
concievably take up (N+N+1+...+M)*K characters for its keys alone.
The actual amount depends on the frequency of each particular word
runs: "a a a ..." will obviously use less space than "a b c d e..."
or "a..z b..za c..zab ...".
If you intend to make either N, the range from N...M or K large,
you might want to consider assigning each word in the text a integer i
+d
and composing your key out of the integers rather than the word string
+s
as you are doing now. Keys composed out of numerical indexes
would save space and possibly search time.
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.