Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask

How to document complex data structures?

by vrk (Chaplain)
on Oct 21, 2007 at 13:14 UTC ( #646265=perlquestion: print w/replies, xml ) Need Help??
vrk has asked for the wisdom of the Perl Monks concerning the following question:

Honourable monks,

I have often the same problem when using complex data structures (HoH, HoA, generally /([HA]o)+[HA]/): when passing data structures to functions or returning data structures, how do you document the form of data to the user of the function?

Perhaps the problem is an indication of the ad hoc nature of the data structures, but suppose you have a function that returns

  • an ID number and a numerical value for that ID number as a hash key and value
  • which are hashes under the hash keys 'changes' and 'originals'
  • with the mean and variance for both under hash keys 'means' and 'variances', which are arrays
  • and all these being an item in an array of similar hashes.

Now, one way I am inclined to document this would be to use straight Perl syntax in describing an "abstracted" item in the data structure, meaning that the parts that can vary are variables:

[ { originals => { $id => $val, ... }, changes => { $id => $val, ... }, means => [ $originals_mean, $changes_mean ], variances => [ $originals_variance, $changes_variance ], }, ... ]

Here the ellipsis means that the substructure contains identical entries where the parameterized parts (variables) vary. Literal names or values would be just those: literals. The benefit of this is that you can now refer to particular values in the data structure simply by using the variable name (such as $id) in describing what they actually contain. Another benefit is that you can use this "template" in traversing the data structure or accessing values.

However, given complex enough data structures, this will be unwieldy to both write and read. Besides this, unless the values are identical in the above, $val should in fact be two different variables, signifying that they do not depend on each other (while $id could be the same, since the same values occur in both 'changes' and 'originals' hashes). Larger data structure means that there will be even more points where you must name the varying values.

Surely there must be better ways. How do you do it?

print "Just Another Perl Adept\n";

Replies are listed 'Best First'.
Re: How to document complex data structures?
by wazzuteke (Hermit) on Oct 21, 2007 at 15:31 UTC
    That's always an interesting question when it comes to non-typed languages. There are a couple things I try to keep in mind when commenting on the method signature of most things I'm working on:

    1.) Return more than one value
    In my opinion, data structures are like objects in that they need to have a generic, singular purpose. I try to avoid having to explain why a hash-ref or an AoH has data on three different and otherwise exclusive things just to maintain a single reference in code. That's just as confusing as anything. In your example, you have an array returning hash-refs that are completely different from one another. I can't, for example, take the list and iterate over it expecting a single DDL. I would either have to test for a key, or shift off the stack and hope things are in the right order. There is no real good way to document something like this.

    A recommendation on this would be to simply break up the AoH into singular hashes. For each, either return a defined structure from a sinle method (ie return $href_original, $href_changes ... . Or, have many methods that return the respective piece of data. This way, you can not only comment on what is happening, but the code is a bit more self-documenting as well.
    Such as:
    # # Intended to be the caller # my $originals = $self->get_originals(); my $changes = $self->get_changes(); # # Intended to be the 'offending' package # ... sub get_originals { return $_[0]->{'original'}; } sub get_changes { return $_[0]->{'changes'} }

    2.) Don't be afraid to comment with structures and text
    The point being, really, that commenting with a pseudo-structure is sometimes the best way to describe what something should look at. Furthermore, comments aren't ususally pointed at yourself (at least I tend to remember code I've written, for the most part). That said, if a cohort doesn't understand the look of a complex data structure either a) They shouldn't be programming Perl b) They will ask someone anyways (me or otherwise). Furthermore, adding text around the structure to explain what they keys *are*, not just their data-type is always useful.

    This could really go on and on, and I'm sure by the time I've posted this there is already three other good examples. The point I am really trying to make is: make sure your structures make sense in the first place *then* don't be afraid to use pseudo-code as a descriptive characteristic in the comments of a method signature.

    perl -le '$.=[qw(104 97 124 124 116 97)];*p=sub{[@{$_[0]},(45)x 2]};*d=sub{[(45)x 2,@{$_[0]}]};print map{chr}@{p(d($.))}'
Re: How to document complex data structures?
by sfink (Deacon) on Oct 21, 2007 at 18:57 UTC
    I think this is a very important question, because it's so critical to keeping things straight and Perl's syntax doesn't make it very obvious what's going on. I have many scripts that I can easily dive back into as a result of my data structure comments, and that I wouldn't have a hope of interpreting a month later without the comments.

    I use something sort of similar to what you are proposing, except I don't use $placeholders, I keep it English-like, and I modify some of the syntax.

    In brief, quoted strings are literals, unquoted strings are abstract data type descriptions (possibly defined explicitly later). (parens) mean a list, [brackets] mean an array ref, and {curlies} mean either a hash or hash ref (they're easily distinguished by a human, since it can only be a hash at the outermost level.) <angle brackets> mean an array ref with a fixed number of elements -- in other words, a tuple. If things get complicated, I use a simple word or phrase as a symbol that I expand later (using a colon to denote "is of type").

    So for example:

    # { student name => grade } my %grades = ( "Bob" => 3.2, "Alice" => 3.9 ); # { node name => [ child node ] } my %graph = ( A => [ "B", "C" ], B => [ "D" ] ); # { student name => { "age" => age, "grade" => GPA } } my %students = ( "Bob" => { age => 34, grade => 3.2 } ); # { name => <age, children, dirty flag> } # where children : { child name => 1 } my %table = ...; print "Bob is $table{Bob}->[0] years old.\n"; print "His children are ", join(" ", keys %{ $table{Bob}[1] }), "\n"; print "Is Guido his child? "; print ($table{Bob}[1]{Guido} ? "yes" : "no"), "\n"; # My version of your example # # Returns: [ stat chunk ] # where stat chunk : # { "originals" => { id => value }, # "changes" => { id => value }, # "means" => <mean of originals, mean of changes>, # "variances" => <variance of originals, variance of changes> # } # # or you could give a symbolic name to the { id => value } things.
    This loses some nice properties of what you wrote -- eg, you can no longer use it as a direct template for constructing expressions. But I find that being able to drop into English for anything complicated is a good tradeoff for losing the additional rigor.

    Then again, I do see the use of a rigorous description -- a couple years back, I wrote a tool that took a syntax somewhere between yours and mine and automatically applied it to a value to extract out the desired data. I won't describe the exact syntax, but it allowed things like:

    my %map = ( a => 1, b => 3, c => 2 ); # Set @keys to ('a', 'b', 'c') # and @values to (1, 3, 2) match('{ @keys => @values }', \%map); my %map2 = ( a => { name => "Bob", children_ids => { Alice => 132, M +abel => 81 } }); # Set $etty to "Mabel" match('{ ? => { children_ids => { $etty => 81 } } }', \%map2);
    and similar tricks. I found it quite handy, except that I really wanted it to declare the variables (@keys, @values, $etty) lexically in addition to initializing them, and I did this before I had heard of source filters. Or perhaps before they existed; I'm not sure.
Re: How to document complex data structures?
by tilly (Archbishop) on Oct 21, 2007 at 21:35 UTC
    It depends on the situation.

    First of all, if I have a regular structure, I can document it with long descriptive variable names. For instance $orders_by_user_by_month is going to be a hash of hashes of arrays. And the keys to use to access it are documented in the name.

    Secondly if a structure gets too complex, don't be afraid to hide details behind objects. For instance in the previous example, an order might well be an object with lots of data about the order. So I really had a hash of hashes of arrays of hashes.

    Thirdly unless I really need it, I get very scared of positional data. Because it is easy to mix it up. Which means that I either don't use it, or I try to limit the use of it to a small portion of code. For instance in your data structure I dislike the positional information in means and variances.

    So in your case I'd suggest something like this:

    [ { original => { mean => $original_mean, variance => $original_variance, values => { $id => $value, ... }, }, changed => { mean => $changed_mean, variance => $changed_variance, values => { $id => $value, ... }, }, ... ]
    and then (depending on how it is used) I'd hide the details of the original and changed data structures behind an object. After which you could choose to have it calculate the mean and variance on the fly. Also you'd probably come up with some opportunities to clean up the code which produces it.
Re: How to document complex data structures?
by planetscape (Chancellor) on Oct 22, 2007 at 03:49 UTC

    If you need to produce pretty pictures to go along with your documentation, you might want to take a look at the last several ideas here: How can I visualize my complex data structure?

    However, as the data struct gets bigger and more complicated, so do the pretty pictures. But it may help.


Re: How to document complex data structures?
by DrHyde (Prior) on Oct 22, 2007 at 09:32 UTC
    I describe the structure in plain English and then give an example. I might say something like ...
    The foo() function returns an arrayref, each element of which represents a bar that I have visited. Each bar is a hashref with the following keys: =head2 name, address, phone The bar's name, address and phone number =head2 good_beer An optional key, which if present will be a hashref of beers, whose keys are the beers' names and the values are the breweries that make them. =head2 cute_staff An optional key, which if present will be an arrayref (one element per cute staff member) of hashrefs, with keys 'name', 'sex', 'preferred_sex' and 'phone'. ... For example: [ { name => 'The Traditional Pub, address => '13 Wharf St, Workingclasstown', phone => '01234 567890', good_beer => { 'Tanglefoot' => 'Badger', 'Paradise Ale' => 'Theakstons', 'Special' => 'Youngs' }, cute_staff => [ # array in case there's two with the { # same name name => 'Jill', sex => 'F', preferred_sex => 'M', phone => '01234567899' } ... ] }, ... ]
Re: How to document complex data structures?
by Cop on Oct 21, 2007 at 15:24 UTC

    data dump it, and put comments along lines.

Re: How to document complex data structures?
by Anonymous Monk on Dec 12, 2013 at 09:40 UTC
    These days you might try
    Data::Sah - Fast and featureful data structure validation
    Sah - Schema for data structures (specification)

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://646265]
Approved by Corion
Front-paged by grep
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others cooling their heels in the Monastery: (3)
As of 2018-01-22 08:29 GMT
Find Nodes?
    Voting Booth?
    How did you see in the new year?

    Results (233 votes). Check out past polls.