Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked

Comment on

( #3333=superdoc: print w/replies, xml ) Need Help??
That MySQL comment bothers me.

With a properly designed BTree, the access time should scale logarithmically with the size of the table. The n*log(n) should be the overall time to build up the whole table (n operations, each of which is log(n)).

Typical data sets range between, say, 1000 and 10**9 rows. Over that range log(n) only changes by a factor of 3, so for all intents and purposes it is just some constant. In real databases the access time is dominated by the time to seek to disk, and so (as my link in my previous post pointed out) the locality of reference of a BTREE can be a good win over hashing because it is more likely to hit pages in cache. Sure, asymptotically that log(n) is infinitely worse than any possible constant, but real life doesn't tend to reach that asymptotic nightmare. (Besides which BTREEs have a good worst-case guaranteed performance. Hashes don't.)

On to other issues.

Using a RAM disk does a lot to remove filesystem overhead. That should, among other things, catch most of the possible speed wins from extracting parallelism in commits (through insert buffers and other tricks). Come to think of it, most of the rest of the filesystem overhead can be removed just by naming your file undef, which tells Berkeley DB that it is dealing with an in memory database, removing the OS's filesystem abstraction layers from the picture.

You do have a good point about doing too much Perl in the processs. Personally I strongly suspect that just changing to a better data structure will win the day. Alternately, if you can, using native hashes is better than pulling in Berkeley DB. But given Perl's memory overhead, I would be concerned about hitting the 32-bit addressing barrier, even if you have enough RAM in the machine to do it. (A wall we will be hearing about more and more often as time goes by...)

In reply to Re (tilly) 2: Performance quandary by tilly
in thread Performance quandary by SwellJoe

Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":

  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?

    What's my password?
    Create A New User
    and all is quiet...

    How do I use this? | Other CB clients
    Other Users?
    Others exploiting the Monastery: (6)
    As of 2018-03-20 01:19 GMT
    Find Nodes?
      Voting Booth?
      When I think of a mole I think of:

      Results (247 votes). Check out past polls.