http://www.perlmonks.org?node_id=404110


in reply to Hash lookups, Database lookups, and Scalability

I'd expect most database lookups to scale like O(log(n)) for a lookup. For scalability I'd also mainly be interested in the case where you had enough data that it doesn't all get stored in RAM, which would require a far larger dataset than you have.
  • Comment on Re: Hash lookups, Database lookups, and Scalability

Replies are listed 'Best First'.
Re^2: Hash lookups, Database lookups, and Scalability
by pg (Canon) on Oct 31, 2004 at 17:59 UTC
    "I'd expect most database lookups to scale like O(log(n)) for a lookup"

    Why? Don't simplify the situation. Database search could range from indexed search to a full table scan. As for indexed search, lots of RDBMS system's look up is indeed hash look up, and index is the way you tell database what hashes to create. Those numbers the OP given is not purely for look up, instead it is a mixture of everything including IO and network communication, thus no way they can be used as it is, to measure the performance of the database search algorithm.

    As for scalability, I want to stress that there is a big SCALABILITY here: to be able to do a tandem look up, you have to come up with extra code and special data structure when you go with hash look up, but not for database look up. What if you need look up ability with 5 different forms of keys? with database, it is simply a different query, at most some new indexes; with hash look up, you have to SCALE UP your coding effort and complicate your code.

      "I'd expect most database lookups to scale like O(log(n)) for a lookup"
      Why? Don't simplify the situation. Database search could range from indexed search to a full table scan. As for indexed search, lots of RDBMS system's look up is indeed hash look up, and index is the way you tell database what hashes to create. Those numbers the OP given is not purely for look up, instead it is a mixture of everything including IO and network communication, thus no way they can be used as it is, to measure the performance of the database search algorithm.
      My statement was entirely based on theory.

      The description given was selecting a single value from a table with indexes on both columns. That means that the lookup is happening on an index. For the big-O estimate you have to look at what happens as the dataset gets large. Network communication is a constant factor and I/O is part of the search time.

      My statement about O(log(n)) therefore presumes that you are using an index with a large dataset. The question then becomes what kind of index. There are many kinds of indexes out there. Yes, you can use a hash and get O(1). But default indexes tend to be a hierarchical datastructure that is O(log(n)), because they cooperate better with caching to avoid I/O, leading to a much better constant. (That is why the search algorithm should be chosen with I/0 in mind.)

      About your coding comments, I have no disagreement with that and have said similar things on many occasions myself.