Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

Data structures in Perl. A C programmer's perspective.

by code-ninja (Scribe)
on Sep 06, 2013 at 10:48 UTC ( #1052691=perlmeditation: print w/ replies, xml ) Need Help??

Dear everyone,

Since this post I ended up in a job after all designated as a Software Engineer/iOS Software Developer. Its cool cuz I get to work on a Mac and since I want to pursue my master's after getting some experience (like 1-1.5 years), the skills I learn will be helpful. I have been kept on the "bench" as my Mac ("my Mac" as in the Apple Mini-Mac that the company has provided me) has not been "configured" yet so I pass my time in office trying to get hold of data structures in Perl. I cannot code anything for iOS without a Mac and I hate sitting idle.

Anyways, being a C programmer, I learned data structures the hard way... i.e. by doing-making mistakes-hating myself for them-retrying until I get it correctly (it was college and I had no social interactions whatsoever so it was just me and my laptop... :D). Applying a similar strategy to Perl, I tried to use the concepts I learned in C and apply them in Perl. I guess, C being the root of all languages (mostly all if some think otherwise), it will be easier to understand "PerlDSC" if we look at it from C perspective. I'll assume that programmers reading this have at least got their hands wet with pointers and linked lists in C and have some programming experience in Perl.

Doubly Linked Lists

To be frank enough, Perl's array are faster and more efficient (thanks to splice and friends) than using linked lists but there are situations where you might want a more complex data structure like a tree to represent your data. The good news is, CPAN has many libraries that we can use to make our life easier but for those (like me) who like to rip things apart, it will be a great exercise to ape the functions of a library just to understand how things work. So I aped the List::DoubleLinked library to make doubly linked lists.

An array is collection of similar type of data where each element is abut each other, i.e., it is a linear data structure with each element adjacent to each other, i.e. if an element is at address 1002 and it is an array of 2-byte integer, the next element will be at 1004. Got it? Good.

A linked list on the other hand subtracts the requirement of adjacency by adding a pointer to the next element in the list. Reasons? It enhances memory management because arrays are static (NOT IN PERL!), you just have to pass a reference to the start of the list and not the entire list (as in the case of arrays) to manipulate it etc. Every element in a linked list is called a node and every node can keep data and has a pointer to the next element. A Doubly linked list enhances this concept by adding a pointer to its previous element. This addition makes it easier to traverse and random access the list more easily.

In C we create a doubly linked list as follows (I give the struct as well as the function to create a node):

struct node { int data; struct node *next; struct node *prev; }; struct node *start = NULL; // pointer to start of list. struct node *end = NULL; // pointer to end of list struct node * create (data) int data; { struct node *temp = (struct node *)malloc(sizeof(struct node)); temp->data = data; temp->next = temp->prev = NULL; return temp; } struct node * push(start, temp) struct node *start; struct node *temp; { if(start == NULL) { // this node is the first node, assign it as is. start = temp; } else { temp->next = start; start->prev = temp; start = temp; } return start; }
create function creates a node that has its data field set to user passed "data" and points the "prev" (pointer to previous node) and "next" pointers to NULL. push function inserts the node at the start of the list (assuming list grows towards left if you read from left to right).

The code is quite intuitive and very easy to understand (that is to say, if you know about pointers and structures and have played with them for a while). In Perl, focusing on OO Perl, the way to create and add a node is as follows:

package doublyLinkedList; sub new { my ($class, @elements) = @_; my $self = bless { start => undef, }, $class; $self->push(@elements); return $self; } sub push { my ($self, @elements) = @_; for my $element (@elements) { my $newNode = { element => $element, next => undef, prev => undef, }; # newNode becomes an anonymous hash ref. if(not defined $self->{start}) { $self->{start} = $newNode; # the list is empty, assign to +start as is } else { $newNode->{next} = $self->{start}; $self->{start}{prev} = $newNode; $self->{start} = $newNode; # else make the new node point +to start and then assign new node as the new start of the list. } } return; }

The new subroutine is a constructor. The bless function ties an anonymous hash to the class name thereby defining an object of the class (or package in Perl lingo). The push subroutine creates a new node by making a reference to an anonymous hash. Contrasting it with C, where we explicitly allocate memory (recall malloc), Perl implicitly allocates enough memory for containing the hash and the variable $newNode points to (read: references) this hash. The rest of the logic remains same and those cryptic looking braces and if conditions are just syntactic sugar.

Extending this to pop operations, C can achieve pop as follows:

void pop(start) struct node *start; { if(start == NULL) { printf("The list is empty!\n"); return; } struct node *temp = start; start = temp->next; temp->next = start->prev = NULL; free(temp); return; # or maybe we can return the element popped before freeing + temp or whatever. }
Cool, ain't it? You check if the list is empty and if not, you assign the start pointer to some temporary variable and point start to the next node (which will be the next top node... think stacks!). We then point the "prev" pointer to NULL indicating that the new position that start points to is the new start.

Perl way:

sub pop { my $self = shift; my $ret = $self->{start}; return if not defined $ret; # return if the list is empty. $self->{start} = $ret->{next}; $self->{start}{prev} = $ret->{next} = undef; return $ret->{element}; }
This is cool too, the logic is same and the only difference is the syntactic sugar. Perl automatically takes care of garbage collection so there is no need to free explicitly. IMO, if you're comfortable playing with DSs in C, it is then just a matter of learning Perl's syntax.

The basic data structures in Perl (arrays and hashes and their derivatives (AoH, HoA etc.)) are enough most of the times but what if we have to implement a RedBlack tree? or some other complex DS? C is my standard choice for them because handling pointers in C is like blissful but Perl's OO paradigms and syntactical sugar (like condensed regexes and stuff) makes it a useful language to handle such complexities. Also, as you can see, the code in C and Perl are exactly same if you ignore the syntax.

Just for the sake of practice and lucidity in writing Perl, I implemented the whole doubly linked list library... though it is almost a direct rip-off of the aforementioned library but hey... I'm getting detailed `know how' about how things work in Perl.

Comment on Data structures in Perl. A C programmer's perspective.
Select or Download Code
Re: Data structures in Perl. A C programmer's perspective.
by RichardK (Priest) on Sep 06, 2013 at 15:34 UTC

    Really, I just don't understand what point you are trying to make.

    Perl already has functions to manipulate arrays and will do everything you need without having to write all that code. see push pop shift unshift and splice.

    It's usually a mistake to mechanically translate C structures into perl, as perl has many more features available (hashes for one). It's OK to write a linked list in perl as a learning exercise but I wouldn't use it for anything serious.

      :-)

      Well, I'm not trying to make a point, thats the point. Its just I was sitting idle and I've been dabbling with Perl for almost a month now, I was just trying to get my understandings and learnings commented on :-).

      Yeah accepted that Perl has many more features and as you said, this was a learning exercise... I was just trying to relate what I already know with what is new. Kinda umm mnemonics if you may.

Re: Data structures in Perl. A C programmer's perspective.
by davido (Archbishop) on Sep 06, 2013 at 17:08 UTC

    You might enjoy Mastering Algorithms with Perl, from OReilly. It's a little old, but good algorithms and datastructures are mostly timeless. The book does a nice job of presenting data structures and algorithms in Perl... even if the Perl spoken is from the early 2000's.

    Yes, Perl has arrays that can do many of the things you might do with a doubly linked list; push, pop, shift, unshift, and splice. But there are tradeoffs in everthing.

    A linked list can do inserts in O(1) time. Arrays in Perl can grow substantially without needing to be moved around in memory, but focusing on splice, it's obvious that an insert anywhere aside from the top or bottom of the array is going to consume O(n) time. If you insert a new element between $a[5] and $a[6], you're going to have to shift 6 and everything thereafter to make room, and that's a linear-time operation whether it's being done in Perl or C.

    On the other hand, lookups in a linked list are O(n). You can't binary search a linked list. You don't have random-access. Everything search is O(n) unless you maintain some crossreference index, which is basically just throwing another datastructure at the problem.

    The main reason that arrays are usually chosen over linked lists in Perl is because arrays are implemented entirely "in Perl's guts", whereas most linked list implementations are in Perl code. So unless the characteristics of the problem are some edge case where you're inserting in the middle of the list frequently but rarely doing searches, arrays will be faster to use.

    Trees, on the other hand, are really useful, even in Perl.


    Dave

Re: Data structures in Perl. A C programmer's perspective.
by eyepopslikeamosquito (Canon) on Sep 06, 2013 at 22:42 UTC

    Be warned that with modern CPUs, the conventional wisdom that linked lists should be preferred to vectors when performing many insertions and deletions in the middle of the list is almost always wrong. From bulldozer blog:

    The vector still kills the linked list in this test. And that’s all because vector stores its data in a contiguous block of memory. The vector has so much better performance (and as you increase the size of your data structure, the difference in performance becomes much more pronounced) because it accesses memory linearly.

    The memory is so slow when you access it randomly, that the penalty for cache misses almost always dominates your CPU usage time. If you care at all about performance, the main memory should be considered a sequential-access device. You have been lied to all your life; RAM is not a random-access memory device!

    This has been true in the last few years and in the future it will only become more so, barring any true revolutions in hardware technology (think Ansible!) or a fundamental change in computing model (think abandoning the von Neumann model,) of which there is no sign yet (and it would take decades, even if we had an alternative model.) This particular performance killer will get worse and worse with time.

    References

      See also The man that knows.

      Watch it 3 times. Then think about it for a week; then watch it again.

      That guy packs so much salient information in each non-committal sentence; he should be cited as much as an accomplished politician as he is a programmer.

      Damn! I would love a 2-hour argument with that guy. He would deflate me completely, but what a learning experience.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
      I've always been told that linked lists are faster (in algorithms course in college). I never really understood why people say that when arrays give indexed access to its elements.I mean its cool that linked lists gives dynamic memory allocation and all but the trade-off is too high. As davido points out, everything will require O(n) time but arrays can do some stuff in O(1) time (fetch an element at an index). Trees and other DSs maybe useful (Perl provides so many already) but linked lists are just a learning exercise for a coding craved mind.

      PS: RAM is not a random-access memory device! I'll need something in written to digest that! :P

        RAM is not a random-access memory device! I'll need something in written to digest that!
        The (still relevant) and classic paper What Every Programmer Should Know About Memory (pdf) by Ulrich Drepper should help you digest how memory really works on modern hardware. Or maybe it will give you indigestion. :) The summary is that cache misses on modern CPUs are mind-bogglingly expensive. Data locality is crucial.

        My post was really about order of growth for various tasks given two different data structures.

        An array stores its data linearly, in contiguous memory. If you want to find the "nth" element, you go straight to it (it's a simple mathematical computation, internally). But if you want to insert between "n" and "n+1" you must shift everything from n+1 through the end of the array out one spot, and hopefully not have to copy the entire array to a larger block of contiguous memory.

        A linked list stores its data all over the place. Each element just points to the location of the next one. Finding the 'nth' element means starting at the top and counting until you get there. But once you're there, inserting between n and n+1 is simply a matter of changing where a couple pointers point.

        So an array has O(1) lookups, and O(n) insertions and deletions. A linked list has O(n) lookups, and O(1) insertions/deletions. But big-O is concerned with orders of growth. There are many other factors that don't get considered when looking at orders of growth of computational complexity.

        But I also mentioned that Perl's arrays, being implemented at the "guts" level, are very fast. And most linked list implementations in Perl are implemented in Perl code, which is comparatively slow.

        Add to that the things that eyepopslikeamosquito mentioned, and it makes it very hard for linked lists to be the data structure of choice. ...even when Perl's arrays have so much behind the scenes overhead going on, they're still quite hard to beat.

        That doesn't mean that we have to throw out the window everything we learned in comp sci. It just means that technology has found ways to bend the rules-of-thumb. A binary tree is still an excellent data structure for certain uses. Linked lists can still make sense for some tasks. A trie is hard to beat for certain types of lookups. I guess what we have to keep in mind is that now more than ever one must fully understand the nature of his data, the usage patterns expected, and the tricks the hardware and compilers can play.

        I don't know too much about the guts-level implementation of Perl's hashes, but it's my understanding that the implementation hashes the key to derive a bucket, and that when there's a collision (more than one item in a bucket), the buckets' contents are stored in linked lists.


        Dave

      I'll simplify it. RAM is somewhat random access. It is better than Flash memory with Flash's multi-KB minimum erase size. But RAM has a huge delay for the 1st 8 byte random read/write. The next 4 to 8 (depending on RAM design), 8 byte blocks, are free. If after the 1st 8 byte block, you request another block, before you get 1st block of 2nd random request, the time to have delivered blocks 2-4 of the 1st random read will have gone by with the RAM bus being idle. You might as well have listened for blocks 2 thru 4 and put them in CPU cache speculatively, otherwise you just wasted RAM bus time.

      Core 2 CPUs have a cacheline of 64 bytes. That design hints that the CPU will read sequentially all the data that it can (16/32/64 bytes) so the RAM bus doesn't go idle. It also means to read 1 random byte from RAM is to read atleast 16 bytes from ram. You might as well use bytes 2 to 16 for something as a C programmer.
Re: Data structures in Perl. A C programmer's perspective.
by sundialsvc4 (Abbot) on Sep 09, 2013 at 14:50 UTC

    When people talk about linked lists “storing data all over the place,” what they are really concerned about is locality of reference, i.e. minimizing the probability of a virtual-memory page fault ... which involves stopping the process in its tracks, doing a physical I/O to retrieve the page from a spinning disk, then restarting the process where it left off.   If you only do that “occasionally,” as the operating-system strives to enable you to do, then all is well.   But if the number of page faults becomes, and stays, excessive, then the system is “thrashing.”   (The shape of the throughput-curve in a thrash situation is elbow-shaped:   beyond the thrash point, performance for pretty much the entire system degrades exponentially.   As they put it in the movie Ghostbusters, the result is “real Wrath of God stuff.”)

    I will admit that I really don’t know how Perl stores a vector, or for that matter, a list or a hash.   Concerns about cache-line misses are beyond my control since they are wholly dependent upon the particular hardware upon which my software might find itself running.   But locality-of-reference is something that I can readily keep in mind, as being something that will very-directly affect the execution of my code far more than the low-level perlguts of Perl’s very highly optimized implementation.

    You can build any sort of data-structure you may require in Perl, or, more likely, discover that CPAN has already done it for you.   Red-black trees?   Sure:   Tree::RedBlack or Tree:RB, to name two.

    Nearly all of the time, when I stumble upon a failed-program, the manifest symptom of the problem is either a timing-hole or (more likely), thrashing.   The solution is to choose a different algorithm to solve the problem.   By the time I first see it, the code has been “diddled” to the point that it is frankly no longer even readable, and nothing helped.

      When people talk about linked lists “storing data all over the place,” what they are really concerned about is locality of reference, i.e. minimizing the probability of a virtual-memory page fault ... which involves stopping the process in its tracks, doing a physical I/O to retrieve the page from a spinning disk,

      Downvoted: Locality of reference has exactly zero to do with page faults; zero to do with physical IO; and zero to do with spinning disk.

      Your whole post is total misinformation.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        reading the article on Locality of Reference on wiki... I guess sundialsvc4 has a point.

        No, he doesn't.

        In the context of the discussion in this thread -- the effects of locality of reference on the performance of arrays or vectors versus linked-lists -- the salient part of the wiki article is:

        Typical memory hierarchy (access times and cache sizes are approximations of typical values used as of 2013 for the purpose of discussion; actual values and actual numbers of levels in the hierarchy vary): CPU registers (8-256 registers) – immediate access, with the speed of the inner most core of the processor

      • L1 CPU caches (32 KiB to 512 KiB) – fast access, with the speed of the inner most memory bus owned exclusively by each cores
      • L2 CPU caches (128 KiB to 24 MiB) – slightly slower access, with the speed of the memory bus shared between twins of cores
      • L3 CPU caches (2 MiB to 32 MiB) – even slower access, with the speed of the memory bus shared between even more cores of the same processor

        Main physical memory (RAM) (256 MiB to 64 GiB) – slow access, the speed of which is limited by the spatial distances and general hardware interfaces between the processor and the memory modules on the motherboard

      • Since people don't appear to have bothered to watch the full video I linked above, here is the salient part of it (7:46). It'd be worth 8 minutes of anyone's time to watch it.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

      When people talk about linked lists “storing data all over the place,” what they are really concerned about is locality of reference, i.e. minimizing the probability of a virtual-memory page fault
      No. The linked list performance problem discussed in this thread was not caused by virtual-memory page faults, but by CPU cache misses.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlmeditation [id://1052691]
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others imbibing at the Monastery: (3)
As of 2014-09-24 00:23 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (243 votes), past polls