Data structures in Perl. A C programmer's perspective.

Dear everyone,

Since this post I ended up in a job after all designated as a Software Engineer/iOS Software Developer. Its cool cuz I get to work on a Mac and since I want to pursue my master's after getting some experience (like 1-1.5 years), the skills I learn will be helpful. I have been kept on the "bench" as my Mac ("my Mac" as in the Apple Mini-Mac that the company has provided me) has not been "configured" yet so I pass my time in office trying to get hold of data structures in Perl. I cannot code anything for iOS without a Mac and I hate sitting idle.

Anyways, being a C programmer, I learned data structures the hard way... i.e. by doing-making mistakes-hating myself for them-retrying until I get it correctly (it was college and I had no social interactions whatsoever so it was just me and my laptop... :D). Applying a similar strategy to Perl, I tried to use the concepts I learned in C and apply them in Perl. I guess, C being the root of all languages (mostly all if some think otherwise), it will be easier to understand "PerlDSC" if we look at it from C perspective. I'll assume that programmers reading this have at least got their hands wet with pointers and linked lists in C and have some programming experience in Perl.

Doubly Linked Lists

To be frank enough, Perl's array are faster and more efficient (thanks to splice and friends) than using linked lists but there are situations where you might want a more complex data structure like a tree to represent your data. The good news is, CPAN has many libraries that we can use to make our life easier but for those (like me) who like to rip things apart, it will be a great exercise to ape the functions of a library just to understand how things work. So I aped the List::DoubleLinked library to make doubly linked lists.

An array is collection of similar type of data where each element is abut each other, i.e., it is a linear data structure with each element adjacent to each other, i.e. if an element is at address 1002 and it is an array of 2-byte integer, the next element will be at 1004. Got it? Good.

A linked list on the other hand subtracts the requirement of adjacency by adding a pointer to the next element in the list. Reasons? It enhances memory management because arrays are static (NOT IN PERL!), you just have to pass a reference to the start of the list and not the entire list (as in the case of arrays) to manipulate it etc. Every element in a linked list is called a node and every node can keep data and has a pointer to the next element. A Doubly linked list enhances this concept by adding a pointer to its previous element. This addition makes it easier to traverse and random access the list more easily.

In C we create a doubly linked list as follows (I give the struct as well as the function to create a node):

struct node {
    int data;
    struct node *next;
    struct node *prev;
};

struct node *start = NULL; // pointer to start of list.
struct node *end = NULL; // pointer to end of list

struct node * create (data)
    int data;
{
    struct node *temp = (struct node *)malloc(sizeof(struct node));
    temp->data = data;
    temp->next = temp->prev = NULL;
    return temp;
}

struct node * push(start, temp)
    struct node *start;
    struct node *temp;
{
    if(start == NULL) {
        // this node is the first node, assign it as is.
        start = temp;
    } else {
        temp->next = start;
        start->prev = temp;
        start = temp;
    }
    return start;
}
[download]

create function creates a node that has its data field set to user passed "data" and points the "prev" (pointer to previous node) and "next" pointers to NULL. push function inserts the node at the start of the list (assuming list grows towards left if you read from left to right).

The code is quite intuitive and very easy to understand (that is to say, if you know about pointers and structures and have played with them for a while). In Perl, focusing on OO Perl, the way to create and add a node is as follows:

package doublyLinkedList;

sub new {
    my ($class, @elements) = @_;    

    my $self = bless {
    start => undef,
    }, $class;
    
    $self->push(@elements);
    return $self;
}

sub push {
    my ($self, @elements) = @_;
    for my $element (@elements) {
    my $newNode = {
        element => $element,
        next => undef,
        prev => undef,
    }; # newNode becomes an anonymous hash ref.
        if(not defined $self->{start}) {
            $self->{start} = $newNode; # the list is empty, assign to 
+start as is
        } else {
            $newNode->{next} = $self->{start};
            $self->{start}{prev} = $newNode;
            $self->{start} = $newNode; # else make the new node point 
+to start and then assign new node as the new start of the list.
        }
    }
    return;
}
[download]

The new subroutine is a constructor. The bless function ties an anonymous hash to the class name thereby defining an object of the class (or package in Perl lingo). The push subroutine creates a new node by making a reference to an anonymous hash. Contrasting it with C, where we explicitly allocate memory (recall malloc), Perl implicitly allocates enough memory for containing the hash and the variable $newNode points to (read: references) this hash. The rest of the logic remains same and those cryptic looking braces and if conditions are just syntactic sugar.

Extending this to pop operations, C can achieve pop as follows:

void pop(start)
    struct node *start;
{
    if(start == NULL) {
        printf("The list is empty!\n");
        return;
    }
    struct node *temp = start;
    start = temp->next;
    temp->next = start->prev = NULL;
    free(temp);
    return; # or maybe we can return the element popped before freeing
+ temp or whatever.
}
[download]

Cool, ain't it? You check if the list is empty and if not, you assign the start pointer to some temporary variable and point start to the next node (which will be the next top node... think stacks!). We then point the "prev" pointer to NULL indicating that the new position that start points to is the new start.

Perl way:

sub pop {
    my $self = shift;
    my $ret = $self->{start};
    return if not defined $ret; # return if the list is empty.
    $self->{start} = $ret->{next};
    $self->{start}{prev} = $ret->{next} = undef;
    return $ret->{element};
}
[download]

This is cool too, the logic is same and the only difference is the syntactic sugar. Perl automatically takes care of garbage collection so there is no need to free explicitly. IMO, if you're comfortable playing with DSs in C, it is then just a matter of learning Perl's syntax.

The basic data structures in Perl (arrays and hashes and their derivatives (AoH, HoA etc.)) are enough most of the times but what if we have to implement a RedBlack tree? or some other complex DS? C is my standard choice for them because handling pointers in C is like blissful but Perl's OO paradigms and syntactical sugar (like condensed regexes and stuff) makes it a useful language to handle such complexities. Also, as you can see, the code in C and Perl are exactly same if you ignore the syntax.

Just for the sake of practice and lucidity in writing Perl, I implemented the whole doubly linked list library... though it is almost a direct rip-off of the aforementioned library but hey... I'm getting detailed `know how' about how things work in Perl.

Comment on Data structures in Perl. A C programmer's perspective. Select or Download Code

Replies are listed 'Best First'.
Re: Data structures in Perl. A C programmer's perspective. by davido (Cardinal) on Sep 06, 2013 at 17:08 UTC
You might enjoy Mastering Algorithms with Perl, from OReilly. It's a little old, but good algorithms and datastructures are mostly timeless. The book does a nice job of presenting data structures and algorithms in Perl... even if the Perl spoken is from the early 2000's. Yes, Perl has arrays that can do many of the things you might do with a doubly linked list; push, pop, shift, unshift, and splice. But there are tradeoffs in everthing. A linked list can do inserts in O(1) time. Arrays in Perl can grow substantially without needing to be moved around in memory, but focusing on splice, it's obvious that an insert anywhere aside from the top or bottom of the array is going to consume O(n) time. If you insert a new element between `$a[5]` and `$a[6]`, you're going to have to shift 6 and everything thereafter to make room, and that's a linear-time operation whether it's being done in Perl or C. On the other hand, lookups in a linked list are O(n). You can't binary search a linked list. You don't have random-access. Everything search is O(n) unless you maintain some crossreference index, which is basically just throwing another datastructure at the problem. The main reason that arrays are usually chosen over linked lists in Perl is because arrays are implemented entirely "in Perl's guts", whereas most linked list implementations are in Perl code. So unless the characteristics of the problem are some edge case where you're inserting in the middle of the list frequently but rarely doing searches, arrays will be faster to use. Trees, on the other hand, are really useful, even in Perl. Dave	[reply] [d/l] [select]
Re: Data structures in Perl. A C programmer's perspective. (vector vs linked list performance) by eyepopslikeamosquito (Archbishop) on Sep 06, 2013 at 22:42 UTC
Be warned that with modern CPUs, the conventional wisdom that linked lists should be preferred to vectors when performing many insertions and deletions in the middle of the list is almost always wrong. From bulldozer blog: The vector still kills the linked list in this test. And that’s all because vector stores its data in a contiguous block of memory. The vector has so much better performance (and as you increase the size of your data structure, the difference in performance becomes much more pronounced) because it accesses memory linearly. The memory is so slow when you access it randomly, that the penalty for cache misses almost always dominates your CPU usage time. If you care at all about performance, the main memory should be considered a sequential-access device. You have been lied to all your life; RAM is not a random-access memory device! This has been true in the last few years and in the future it will only become more so, barring any true revolutions in hardware technology (think Ansible!) or a fundamental change in computing model (think abandoning the von Neumann model,) of which there is no sign yet (and it would take decades, even if we had an alternative model.) This particular performance killer will get worse and worse with time. References bulldozer blog: C++ vector vs list performance Stack Overflow question on C++ list vs vector blog: C++ vector vs list performance (broken link) Number crunching: why you should never use linked-list in your code again Bjarne Stroustrup: Why you should avoid Linked Lists (3:30-5:00) (youtube: from GoingNative 2012 Keynote) Herb Sutter talk (youtube, around 48:40 mark) References Added Later What Every Programmer Should Know About Memory (pdf) by Ulrich Drepper The 1021 Problem (Part 3) (see opening quote from Bjarne Stroustrup and the Prefetching section) The 1021 Problem (Part 2) (see "Modern Memory Architecture" section) Re^3: [OT:] Is this Curriculum right? (my provocative response in a thread by the inimitable karlgoethebier) (Nov 2021)	[reply]
Re^2: Data structures in Perl. A C programmer's perspective. by BrowserUk (Patriarch) on Sep 07, 2013 at 00:33 UTC
See also The man that knows. Watch it 3 times. Then think about it for a week; then watch it again. That guy packs so much salient information in each non-committal sentence; he should be cited as much as an accomplished politician as he is a programmer. Damn! I would love a 2-hour argument with that guy. He would deflate me completely, but what a learning experience. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply]
Re^2: Data structures in Perl. A C programmer's perspective. by bulk88 (Priest) on Sep 07, 2013 at 06:35 UTC
I'll simplify it. RAM is somewhat random access. It is better than Flash memory with Flash's multi-KB minimum erase size. But RAM has a huge delay for the 1st 8 byte random read/write. The next 4 to 8 (depending on RAM design), 8 byte blocks, are free. If after the 1st 8 byte block, you request another block, before you get 1st block of 2nd random request, the time to have delivered blocks 2-4 of the 1st random read will have gone by with the RAM bus being idle. You might as well have listened for blocks 2 thru 4 and put them in CPU cache speculatively, otherwise you just wasted RAM bus time. Core 2 CPUs have a cacheline of 64 bytes. That design hints that the CPU will read sequentially all the data that it can (16/32/64 bytes) so the RAM bus doesn't go idle. It also means to read 1 random byte from RAM is to read atleast 16 bytes from ram. You might as well use bytes 2 to 16 for something as a C programmer.	[reply]
Re^2: Data structures in Perl. A C programmer's perspective. by code-ninja (Scribe) on Sep 07, 2013 at 03:47 UTC
I've always been told that linked lists are faster (in algorithms course in college). I never really understood why people say that when arrays give indexed access to its elements.I mean its cool that linked lists gives dynamic memory allocation and all but the trade-off is too high. As davido points out, everything will require O(n) time but arrays can do some stuff in O(1) time (fetch an element at an index). Trees and other DSs maybe useful (Perl provides so many already) but linked lists are just a learning exercise for a coding craved mind. PS: RAM is not a random-access memory device! I'll need something in written to digest that! :P	[reply]
Re^3: Data structures in Perl. A C programmer's perspective. by davido (Cardinal) on Sep 07, 2013 at 04:56 UTC
My post was really about order of growth for various tasks given two different data structures. An array stores its data linearly, in contiguous memory. If you want to find the "nth" element, you go straight to it (it's a simple mathematical computation, internally). But if you want to insert between "n" and "n+1" you must shift everything from n+1 through the end of the array out one spot, and hopefully not have to copy the entire array to a larger block of contiguous memory. A linked list stores its data all over the place. Each element just points to the location of the next one. Finding the 'nth' element means starting at the top and counting until you get there. But once you're there, inserting between n and n+1 is simply a matter of changing where a couple pointers point. So an array has O(1) lookups, and O(n) insertions and deletions. A linked list has O(n) lookups, and O(1) insertions/deletions. But big-O is concerned with orders of growth. There are many other factors that don't get considered when looking at orders of growth of computational complexity. But I also mentioned that Perl's arrays, being implemented at the "guts" level, are very fast. And most linked list implementations in Perl are implemented in Perl code, which is comparatively slow. Add to that the things that eyepopslikeamosquito mentioned, and it makes it very hard for linked lists to be the data structure of choice. ...even when Perl's arrays have so much behind the scenes overhead going on, they're still quite hard to beat. That doesn't mean that we have to throw out the window everything we learned in comp sci. It just means that technology has found ways to bend the rules-of-thumb. A binary tree is still an excellent data structure for certain uses. Linked lists can still make sense for some tasks. A trie is hard to beat for certain types of lookups. I guess what we have to keep in mind is that now more than ever one must fully understand the nature of his data, the usage patterns expected, and the tricks the hardware and compilers can play. I don't know too much about the guts-level implementation of Perl's hashes, but it's my understanding that the implementation hashes the key to derive a bucket, and that when there's a collision (more than one item in a bucket), the buckets' contents are stored in linked lists. Dave	[reply]
Re^3: Data structures in Perl. A C programmer's perspective. by eyepopslikeamosquito (Archbishop) on Sep 07, 2013 at 04:13 UTC
RAM is not a random-access memory device! I'll need something in written to digest that! The (still relevant) and classic paper What Every Programmer Should Know About Memory (pdf) by Ulrich Drepper should help you digest how memory really works on modern hardware. Or maybe it will give you indigestion. :) The summary is that cache misses on modern CPUs are mind-bogglingly expensive. Data locality is crucial. Update: Prefetching can help - see, for example, the "Prefetching" section at The 10**21 Problem (Part 3).	[reply]
Re: Data structures in Perl. A C programmer's perspective. by RichardK (Parson) on Sep 06, 2013 at 15:34 UTC
Really, I just don't understand what point you are trying to make. Perl already has functions to manipulate arrays and will do everything you need without having to write all that code. see push pop shift unshift and splice. It's usually a mistake to mechanically translate C structures into perl, as perl has many more features available (hashes for one). It's OK to write a linked list in perl as a learning exercise but I wouldn't use it for anything serious.	[reply]
Re^2: Data structures in Perl. A C programmer's perspective. by code-ninja (Scribe) on Sep 06, 2013 at 16:25 UTC
:-) Well, I'm not trying to make a point, thats the point. Its just I was sitting idle and I've been dabbling with Perl for almost a month now, I was just trying to get my understandings and learnings commented on :-). Yeah accepted that Perl has many more features and as you said, this was a learning exercise... I was just trying to relate what I already know with what is new. Kinda umm mnemonics if you may.	[reply]
A reply falls below the community's threshold of quality. You may see it by logging in.

Back to Meditations