Beefy Boxes and Bandwidth Generously Provided by pair Networks RobOMonk
more useful options

Nested data structures... nasty?

by vladb (Vicar)
on May 28, 2002 at 14:31 UTC ( #169779=perlmeditation: print w/ replies, xml ) Need Help??

Hail thee fellow brethren!

The term ‘data structures’ inherently implies nested structures built up from other simpler structures to tie in a certain relational bond. While I have seen a number of articles/books dedicated to the sole subject of proper data structures design principles and practices, I couldn’t full together enough resources related specifically to Perl language. Although, this material was useful in that it gave me a good grounding in the general theory that seems to permeate all languages and cultures, I still feel the need for discussing topics specific to Perl. For the way structures are built in Perl may differ substantially from other languages such as C, Java, or even Pascal.

Perl is a very liberal language when it comes to building complex data structures. It is much simpler to build nested HoH or AoH structures in Perl than C, for example. However, as with the rest of the language, this may be subject to severe abuse especially at the hands of an inapt developer. In a couple perl scripts that I had to either maintain or review, I have seen data structures that exhibited the symptoms of a cancerous tumor whereby they would expand uncontrollably at run-time gobbling all available process memory thereby sending the surrounding system into abyss. On the other hand, I had also come across some very clever data structures that bore much sense and thereby facilitated overall understanding of the script, it’s algorithms used, and data relationships (actually, predominant number of modules approved on CPAN may contain examples of proper data structures handling…).

To summarize my past (solitary) meditations, let me bring up an itemized list of benefits and disadvantages of complex data structures in Perl. Remember, whether these hold true in any given case largely depends on the qualifications of individual ‘hacker’.
The benefits of elaborate data structures:

  • Simplify algorithms operating on chunks of data.
  • Clarify data relationships.
  • Simply subroutine parameter passing (use a hash ref or hash in place of a list of arguments).
  • Increase overall code maintainability.
  • Reduce run-time memory requirements.
  • Dare to add more to this list? :)

Drawbacks of complex data structures:

  • Well, added complexity to deal with on top of already ‘complex’ (pun intended) language.
  • Memory requirements (both run-time and when storing in a static form on the hard drive etc).
  • Tendency of sloppy data structures to obscure program logic.
  • Dare to add more to this list? :)

Note: I would like to also include sample code and more ‘discussion’ on each listed item as time allows. Most likely, I’ll add a few in the follow up posts to keep this node within a reasonable size. By the way, your input will also be greatly appreciated! ;-)

$"=q;grep;;$,=q"grep";for(`find . -name ".saves*~"`){s;$/;;;/(.*-(\d+) +-.*)$/; $_=<a HREF="/"> "," $2 </a>;`@$_ +`?{print"+ $1"}:{print"- $1"}&&`rm $1`; print$\;}

Comment on Nested data structures... nasty?
Download Code
Re: Nested data structures... nasty?
by FoxtrotUniform (Prior) on May 28, 2002 at 15:38 UTC

    The only major thing that irritates me about Perl's way of doing complex data structures is that the specification is implicit: you never sit down and declare a "Hash of Hashes of Arrays", for instance: you put arrayrefs in a hash, then a ref to that hash in another hash. And you never sit down and code up a list of admissible hash keys... if you haven't documented what's expected to be in the hash, and where it's inserted, your code can be pretty impenetrable.

    On the whole, I think this way of doing things is better than the alternative: well-defined, rigid data structures described at compile time. (Yes, folks, I like LISP too -- conses are your friends. :-) What irritates me is that these "implicit" structures imposes an extra documentation burden, and one that's not always easily satisfied in code. Rob Pike claims that it's better to put the complexity in the data than the code, and I agree, but that doesn't help you if the data is undecipherable.

    Update: Another possible disadvantage of Perlish (or LISPy, HaskellicSchemeing, etfc) "implicit, on the fly" data structures is that some compiler optimizations are difficult or even impossible. NOTE: I haven't actually tested these assumptions empirically; for one thing, I'm at work, and don't have the time. I'm going on my understanding of modern computer architecture and compiler optimization, which may be sorely lacking. Thou hast been warned.

    For instance, if I want a 3d vector, I could write the following C:

    typedef struct { float x; float y; float z; } vec3d;

    Now, the compiler knows that a vec3d takes up exactly 12 bytes (assuming 4-byte floats), and can make a bunch of optimizations based on that knowledge and some information about the processor it's compiling for. For instance, it can pad the struct out to 16 bytes if addressing is faster on 16-byte boundaries. It can take a declaration like:

    vec3d vertices[20];

    and allocate 240 (or 320) contiguous bytes for it.

    Contrast that to the implicit equivalent:

    my %vert = ( 'x' => undef, # placeholder 'y' => undef, 'z' => undef, );

    Perl knows nothing about the size of this hash: it has three elements now, but there's nothing stopping you from adding a dozen more in the very next statement. And since the size of the hash isn't guaranteed, the best Perl can do if you put twenty vertices in an array is allocate twenty contiguous references: that's better than nothing, but unless you're improbably lucky careful in your memory management, the hashes those refs point to are going to be scattered all over memory, which means cache misses and subsequent stalls.

    The Perlish alternative is to create data structures in scalars with pack, which is just fugly.

    The "Perl is slower than C!" thesis shouldn't be any surprise to anyone, of course. I just want to point out that the flexibility of hashed (hashish? ;-) structures is sometimes a disadvantage. Again, I want to emphasize that I tend to like this way better, but I'd be a fool to pretend that it's perfect.

    Update 2: Minor grammar corrections.

    Update 3: I'm not picking on Common LISP in particular, just the "add stuff on the fly" way of building complex structures, which you can do in Common LISP (or any other LISP, for that matter), and in my experience is the most common way of building structures.

    A good general rule for finding language features in Common LISP is "it's in there somewhere". :-)

    The hell with paco, vote for Erudil!

      I agree pretty much totaly. ("Never say never".)

      This is going to be one of the nice things about perl6 -- you can nicely mix both styles, and declare a hash with limited namespace, each entry of which has a specific type, and thus get somthing like a C struct with known size and thus more optimizable.

      You can also have limited namespace but not declare the types of the entries strictly. Or pretty well anything else you can think of.

      Because the parrot virtual machine resembles a real machine, we can pull on all the experince with compiler optimization to real (register) machines.

      We are using here a powerful strategy of synthesis: wishful thinking. -- The Wizard Book

      Common Lisp has had those kind of optimizations for ages. You tell it what your variables hold, and how much effort it should spend compiling for speed, and it figures out the rest. The only catch is that it still does garbage collection.

      For a real example of Common Lisp used in a high-performance situation, read this description of how Orbitz works.

      One of the key places this problem is really a problem is in the internal data structure of objects. Of course an object can be a blessed just-about-anything, but most people use a hash, and store the object state in there.

      In Perl 5 you have 'fields' which can help clean up the problem. I also really like the way accessors are handled in the reformed-perl OO syntax module: (my writeup, authors writeup).

Re: Nested data structures... nasty?
by Abigail-II (Bishop) on May 28, 2002 at 16:16 UTC
    It's a myth that Perl has more complex datastructures than other languages. It doesn't. In fact, Perl doesn't even have complex datastructures build in the language. All it has are scalars, arrays, hashes, and references. That's it.

    It's a mistake to consider $foo [$bar] {$baz {$quux}} to be a complex datastructure. It ain't. It's still arrays and hashes. One does the same thing in C, except that one is more likely to use a struct instead of a hash. But in C, that still isn't complex. Now, a five dimensional partition tree using epsilon nets, with associated k-d trees in the nodes, that's a complex datastructure.

    The only thing "complex" about $foo [$bar] {$baz {$quux}} is the long single expression to get an element out of it. But you have your data, and you need to store you data somewhere. Preferably in a way to retrieve your elements. And regardless whether you use arrays, hashes, or objects, you have two options. Either you nest, or you don't. And if you nest, you either use long single expressions to get to the data, or many short expressions. (Note that this is independent of the language you are working in). If you don't nest, I doubt your datastructure becomes any clearer.

    BTW, I don't agree with Rob Pikes claim which FoxtrotUniform brings up. Nor do I agree with Wirth's "Algoritms + Datastructures = Programs". I believe that datastructures imply the algorithms. One cannot choice a datastructure and afterwards come up with the algorithm. The algorithm has to follow the rigids of the datastructure.


      You said:
      Now, a five dimensional partition tree using epsilon nets, with associated k-d trees in the nodes, that's a complex datastructure.
      Seems arbitrary to me. How about 4 dimensional? Or 3 dimensional using red-black trees? Or AVL trees? Where in particular do you draw the line in the data structure sand that says complex on this side, everything else on that side? I'm not saying that complex data structures don't exist, I'm just wondering about how you distinguish complex from not complex...


      "Never try to teach a pig to sing…it wastes your time and it annoys the pig."
        I'm just wondering about how you distinguish complex from not complex.

        Well, I don't think it's possible to draw a sharp border. Just as it's almost impossible to draw the line when you call it "light" or "dark". That doesn't mean that in a lot of cases, it's clear when it's "light" or "dark". ;-)

        Anyway, for me, to call a datastructure complex it usually satisfies these points:

        • It stores elements where the place in the structure implies (or is implied by) a (spacial) relationship between the elements.
        • It's optimized such that queries and/or updates can be done in less time than a full scan, or a rebuild of the entire datastructure.
        • It will often have some kind of associated datastructure.
        But I don't apply the rules very strict.


Re: Nested data structures... nasty?
by perrin (Chancellor) on May 28, 2002 at 17:15 UTC
    I'm not quite sure what point you're trying to make here. I don't see using complex data structures or not as a choice. You should just use the simplest data structure that allows you to work with your data in the way you need to. It sounds like you're saying that inexperienced coders often make incorrect data structures, but that could be said about any number of things and is not specific to Perl.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlmeditation [id://169779]
Approved by redsquirrel
Front-paged by Aristotle
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others browsing the Monastery: (12)
As of 2014-04-18 14:11 GMT
Find Nodes?
    Voting Booth?

    April first is:

    Results (469 votes), past polls