in reply to Data structures in Perl. A C programmer's perspective.

Be warned that with modern CPUs, the conventional wisdom that linked lists should be preferred to vectors when performing many insertions and deletions in the middle of the list is almost always wrong. From bulldozer blog:

The vector still kills the linked list in this test. And thatís all because vector stores its data in a contiguous block of memory. The vector has so much better performance (and as you increase the size of your data structure, the difference in performance becomes much more pronounced) because it accesses memory linearly.

The memory is so slow when you access it randomly, that the penalty for cache misses almost always dominates your CPU usage time. If you care at all about performance, the main memory should be considered a sequential-access device. You have been lied to all your life; RAM is not a random-access memory device!

This has been true in the last few years and in the future it will only become more so, barring any true revolutions in hardware technology (think Ansible!) or a fundamental change in computing model (think abandoning the von Neumann model,) of which there is no sign yet (and it would take decades, even if we had an alternative model.) This particular performance killer will get worse and worse with time.


References Added Later

  • Comment on Re: Data structures in Perl. A C programmer's perspective.

Replies are listed 'Best First'.
Re^2: Data structures in Perl. A C programmer's perspective.
by BrowserUk (Pope) on Sep 07, 2013 at 00:33 UTC

    See also The man that knows.

    Watch it 3 times. Then think about it for a week; then watch it again.

    That guy packs so much salient information in each non-committal sentence; he should be cited as much as an accomplished politician as he is a programmer.

    Damn! I would love a 2-hour argument with that guy. He would deflate me completely, but what a learning experience.

    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re^2: Data structures in Perl. A C programmer's perspective.
by bulk88 (Priest) on Sep 07, 2013 at 06:35 UTC
    I'll simplify it. RAM is somewhat random access. It is better than Flash memory with Flash's multi-KB minimum erase size. But RAM has a huge delay for the 1st 8 byte random read/write. The next 4 to 8 (depending on RAM design), 8 byte blocks, are free. If after the 1st 8 byte block, you request another block, before you get 1st block of 2nd random request, the time to have delivered blocks 2-4 of the 1st random read will have gone by with the RAM bus being idle. You might as well have listened for blocks 2 thru 4 and put them in CPU cache speculatively, otherwise you just wasted RAM bus time.

    Core 2 CPUs have a cacheline of 64 bytes. That design hints that the CPU will read sequentially all the data that it can (16/32/64 bytes) so the RAM bus doesn't go idle. It also means to read 1 random byte from RAM is to read atleast 16 bytes from ram. You might as well use bytes 2 to 16 for something as a C programmer.
Re^2: Data structures in Perl. A C programmer's perspective.
by code-ninja (Scribe) on Sep 07, 2013 at 03:47 UTC
    I've always been told that linked lists are faster (in algorithms course in college). I never really understood why people say that when arrays give indexed access to its elements.I mean its cool that linked lists gives dynamic memory allocation and all but the trade-off is too high. As davido points out, everything will require O(n) time but arrays can do some stuff in O(1) time (fetch an element at an index). Trees and other DSs maybe useful (Perl provides so many already) but linked lists are just a learning exercise for a coding craved mind.

    PS: RAM is not a random-access memory device! I'll need something in written to digest that! :P

      My post was really about order of growth for various tasks given two different data structures.

      An array stores its data linearly, in contiguous memory. If you want to find the "nth" element, you go straight to it (it's a simple mathematical computation, internally). But if you want to insert between "n" and "n+1" you must shift everything from n+1 through the end of the array out one spot, and hopefully not have to copy the entire array to a larger block of contiguous memory.

      A linked list stores its data all over the place. Each element just points to the location of the next one. Finding the 'nth' element means starting at the top and counting until you get there. But once you're there, inserting between n and n+1 is simply a matter of changing where a couple pointers point.

      So an array has O(1) lookups, and O(n) insertions and deletions. A linked list has O(n) lookups, and O(1) insertions/deletions. But big-O is concerned with orders of growth. There are many other factors that don't get considered when looking at orders of growth of computational complexity.

      But I also mentioned that Perl's arrays, being implemented at the "guts" level, are very fast. And most linked list implementations in Perl are implemented in Perl code, which is comparatively slow.

      Add to that the things that eyepopslikeamosquito mentioned, and it makes it very hard for linked lists to be the data structure of choice. ...even when Perl's arrays have so much behind the scenes overhead going on, they're still quite hard to beat.

      That doesn't mean that we have to throw out the window everything we learned in comp sci. It just means that technology has found ways to bend the rules-of-thumb. A binary tree is still an excellent data structure for certain uses. Linked lists can still make sense for some tasks. A trie is hard to beat for certain types of lookups. I guess what we have to keep in mind is that now more than ever one must fully understand the nature of his data, the usage patterns expected, and the tricks the hardware and compilers can play.

      I don't know too much about the guts-level implementation of Perl's hashes, but it's my understanding that the implementation hashes the key to derive a bucket, and that when there's a collision (more than one item in a bucket), the buckets' contents are stored in linked lists.


      RAM is not a random-access memory device! I'll need something in written to digest that!
      The (still relevant) and classic paper What Every Programmer Should Know About Memory (pdf) by Ulrich Drepper should help you digest how memory really works on modern hardware. Or maybe it will give you indigestion. :) The summary is that cache misses on modern CPUs are mind-bogglingly expensive. Data locality is crucial.