CGI applications load code and data into memory, do something with it,
and quit. The time needed to do a hash lookup or binary search is
far less than the time required to read in a large file - so what if
we could do binary searches without ever reading the file into
memory at all?
Sorry if this has already been posted - I'm not a regular
reader. Flames graciously accepted.
This implementation is pointless for small files, but for large and
very large files contained sorted data, only a tiny portion of the
file is ever accessed.
my $word = shift @ARGV;
open my $fh, 'WikiWikiList';
sub flup {
my $start = shift;
my $stop = shift;
my $mid = int(($start + $stop) / 2);
return 0 if $start == $mid or $stop == $mid;
seek $fh, $mid, 0;
<$fh>;
my $line = <$fh>; chomp $line;
print "debug: $start -> $mid -> $stop cmp: ", ($word cmp $line),
+" $word vs $line\n";
return flup($start, $mid) if(($word cmp $line) == -1);
return flup($mid, $stop) if(($word cmp $line) == 1);
return 1;
}
if(flup(0, -s $fh)) {
print qq{\xb7 Other Wikis: <a href="http://c2.com/cgi/wiki?$word">
+$word on WikiWiki</a><br><br>\n};
}
Binary searches quickly find items, like a hash. Hashes and binary
searches have long competed in databases and like applications.
Each has strengths and weaknesses. To implement a binary search,
you start repeatedly (iteratively or recursively) narrow the
focus of where you're looking for something. You start in middle.
If the thing you're looking for comes before the middle, you
look between the beginning and the middle. If it comes after the middle,
you look between the middle and the end. When you recurse or iterate,
the definition of either "end" or "start" changes to be what was the
middle - thus narrowing the scope.
This isn't quite O(log N), it's actually O(2*log N) - it takes
approximately twice as many searches, depending on the length of
strings being lookedup, because we aren't looking up exact record
boundaries, only byte boundaries. The last few tries will return the
same word repeatedly, or will bounce between two words before realizing
that the sought after word isn't in there. This only happens in case
of failure - it runs at full speed in case of success.
This could easily be adapted to do lookups on only the key portion of key-value
pairs to find the value part.
Change:
my $line = <$fh>; chomp $line;
To:
my $line = <$fh>; chomp $line; (my $line, my $value) = split / /, $l
+ine;
And change:
return 1;
To:
return $value;
This was written to implement http://www.c2.com/cgi/wiki?SisterSites on
http://wiki.slowass.net/?TinyWiki .
Edited by merlyn to fix code tags (make them lowercase, not uppercase!)