Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

Binary Searches on Sorted Text Files

by scrottie (Scribe)
on Mar 14, 2003 at 03:51 UTC ( #242942=snippet: print w/ replies, xml ) Need Help??

Description: CGI applications load code and data into memory, do something with it, and quit. The time needed to do a hash lookup or binary search is far less than the time required to read in a large file - so what if we could do binary searches without ever reading the file into memory at all?
Sorry if this has already been posted - I'm not a regular reader. Flames graciously accepted.

This implementation is pointless for small files, but for large and very large files contained sorted data, only a tiny portion of the file is ever accessed.


  my $word = shift @ARGV;

  open my $fh, 'WikiWikiList';

  sub flup {
    my $start = shift;
    my $stop = shift;
    my $mid = int(($start + $stop) / 2);
    return 0 if $start == $mid or $stop == $mid;
    seek $fh, $mid, 0;
    <$fh>;
    my $line = <$fh>; chomp $line;
    print "debug: $start -> $mid -> $stop  cmp: ", ($word cmp $line), 
+"   $word vs $line\n";
    return flup($start, $mid) if(($word cmp $line) == -1);
    return flup($mid, $stop) if(($word cmp $line) == 1);
    return 1;
  }

  if(flup(0, -s $fh)) {
    print qq{\xb7 Other Wikis: <a href="http://c2.com/cgi/wiki?$word">
+$word on WikiWiki</a><br><br>\n};
  }


Binary searches quickly find items, like a hash. Hashes and binary searches have long competed in databases and like applications. Each has strengths and weaknesses. To implement a binary search, you start repeatedly (iteratively or recursively) narrow the focus of where you're looking for something. You start in middle. If the thing you're looking for comes before the middle, you look between the beginning and the middle. If it comes after the middle, you look between the middle and the end. When you recurse or iterate, the definition of either "end" or "start" changes to be what was the middle - thus narrowing the scope. This isn't quite O(log N), it's actually O(2*log N) - it takes approximately twice as many searches, depending on the length of strings being lookedup, because we aren't looking up exact record boundaries, only byte boundaries. The last few tries will return the same word repeatedly, or will bounce between two words before realizing that the sought after word isn't in there. This only happens in case of failure - it runs at full speed in case of success. This could easily be adapted to do lookups on only the key portion of key-value pairs to find the value part.

Change:

  my $line = <$fh>; chomp $line;

To:

  my $line = <$fh>; chomp $line; (my $line, my $value) = split / /, $l
+ine;

And change:

  return 1;

To:

  return $value;

This was written to implement http://www.c2.com/cgi/wiki?SisterSites on http://wiki.slowass.net/?TinyWiki .
Edited by merlyn to fix code tags (make them lowercase, not uppercase!)
Comment on Binary Searches on Sorted Text Files
Select or Download Code
Re: Binary Searches on Sorted Text Files
by scrottie (Scribe) on Mar 14, 2003 at 03:53 UTC
    Okey, lower case code tags don't work. Noted. Lets try that again:
    my $word = shift @ARGV; open my $fh, 'WikiWikiList'; sub flup { my $start = shift; my $stop = shift; my $mid = int(($start + $stop) / 2); return 0 if $start == $mid or $stop == $mid; seek $fh, $mid, 0; <$fh>; my $line = <$fh>; chomp $line; print "debug: $start -> $mid -> $stop cmp: ", ($word cmp $line), +" $word vs $line\n"; return flup($start, $mid) if(($word cmp $line) == -1); return flup($mid, $stop) if(($word cmp $line) == 1); return 1; } if(flup(0, -s $fh)) { print qq{\xb7 Other Wikis: <a href="http://c2.com/cgi/wiki?$word"> +$word on WikiWiki</a><br><br>\n}; }
      Okey, I'm a complete idiot. Lower case code tags do work, but only the first code tag is honoured. I hate other peoples forum software =P
Re: Binary Searches on Sorted Text Files
by runrig (Abbot) on Mar 14, 2003 at 06:38 UTC
Re: Binary Searches on Sorted Text Files
by pg (Canon) on Mar 14, 2003 at 07:18 UTC
    Tried your code, there is one thing you may want to fix. I tried with this test file:
    a
    b
    c
    d
    e
    f
    g
    h
    j
    k
    l
    
    try to search for p, your snippet prints out lots of warning for using unitialized variable.

    Tried to run without -w, so I can clearly see your debug info. It seems to me that your script didn't realize quickly enough that p actually does not exist.

    I would like to borrow this snippet, be sure to post your new versions.
      pg, right, its stupid in favor of simplicity - probably not the local style - I wouldn't know. I can't figure out the little message sender box - can you email me at scott@slowass.net and move the discussion there? I'm liable to say something really unpopular, which I don't want to do in public ;) Thanks

Back to Snippets Section

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: snippet [id://242942]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (11)
As of 2014-10-23 20:38 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    For retirement, I am banking on:










    Results (129 votes), past polls