Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

Comment on

( #3333=superdoc: print w/ replies, xml ) Need Help??
No, no, no. Perl's strategy for growing data structures makes the amortized average insert time O(1). So growing a data structure of size n is O(n), not O(n log(n)).

For example consider a hash in the worst case scenario, where you have just done a hash split. Then every element has been moved once. Half of them were around for the previous hash split, so have been around twice. Half of them were around the previous time, and so have been around 3 times. And so on. So the total amount of moving is

n + n/2 + n/4 + ... = (1 + 1/2 + 1/4 + ...) * n = 2n = O(n)
thanks to the well-known geometric series.

With arrays the logic is similar except that each time we only grow an extra 1/5, and the overhead turns out to be 6n.

The result is that building either the hash or the array is O(n). But sorting the array is O(n log(n)). Therefore the hash scales better.

But (very big but) note that log(n) grows very slowly. If the data structures are big enough that they live on disk, then accessing your hash involves a disk seek, while merge sort involves accessing data on disk sequentially. Disk seeks are painfully slow, while accessing data on disk sequentially is orders of magnitude more efficient. That means that for data that lives on disk, the array approach is faster. Given the necessary numbers, generally by a couple of orders of magnitude.

But (very small but) hashes still scale better than arrays. So for very big data structures, hashing will again be faster than the array approach. However you are extremely unlikely to ever encounter a data set that large, if you did it probably wouldn't fit on your hard drive, and processing it sequentially by either approach would be too slow anyways. So you will want to parallelize your data processing. And when you do that, you will find that merge sort parallelizes better than hash lookups so arrays still win.

Thereby demonstrating that the theoretical big-O is not always the end of the story.

A final random note. Contrary to your final comment, it is the array that benefits more from duplicates. That is because if you're slightly clever in your merge sort, then eliminating lots of duplicates will reduce the size of your large passes, speeding up the sort. Hashing, by contrast, goes at the same exact speed but will take up less memory. (Until you seek to disk...)

Update: Lots of minor edits to improve wording.


In reply to Re^2: mathematical proof by tilly
in thread mathematical proof by donkeykong

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • Outside of code tags, you may need to use entities for some characters:
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?
    Username:
    Password:

    What's my password?
    Create A New User
    Chatterbox?
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others about the Monastery: (13)
    As of 2014-09-02 17:32 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?

      My favorite cookbook is:










      Results (25 votes), past polls