Project Metadata Model

My head is sore from knocking up against a number of project development issues that I now think are related. Some have chided me for my obsessions but if these issues are trivial, why aren't there good solutions? Note that when I hear many shouts of Oh, that's done and each wagging finger points to a different shortest distance between points, I'm in doubt. And there seems to be no greater disagreements than about overall project management, structure, development, and deployment.

Apart from a few contrarians, everyone agrees that it's wise to use strict, wise to use no hard tabs in code, unwise to write spaghetti code. When dealing with databases some use should be made of DBD; when parsing command line arguments, GetOpt::Long still is the favorite; when testing, Test::More and Test::Harness are standard. Why then does the picture get so blurry when the camera zooms out?

This promises to be a big project itself with many smaller projects attached to it. So I'm not interested in any solution bound to a limited number of platforms or languages. I run Perl on Linux so that will be the reference implementation. Others will have different needs. Therefore the good solution will be platform and language agnostic.

What Metadata?

I've been avoiding the word metadata; too heavily overloaded. I see people using it when they mean description or index. It's also a favorite playground for CS theorists and data professionals. But I must admit that many of my outstanding issues seem to tie into the same thing; which properly is project metadata.

Rather than attempt an abstract definition I'll give examples:

Test scripts can be chosen from, and organized into, a suite on many criteria; any of which might be considered metadata:
- ideal execution sequence
- if the script should ship with the project
- optional or mandatory
- original author and any later contributors
- general type of test: api, unit, etc.
- target module
- unit or feature tested
- pass/fail history
- testing approach used and test modules required

Time to start a new project. Template placeholders must be replaced by literal values. If at creation I say CLIENT => 'Citizen Canine' then that's metadata about the project as a whole.

At release, installation, and in between when the dist is hosted, browsed, and indexed; Build.PL (or your poison of choice) and much of what it creates contains essentially nothing but metadata.

What are all those dirs doing in your project tree? Are they there to permit you to have multiple files with the same name? Why not put them all together? Sometimes the filesystem hierarchy is the simplest way to tell tools where to look. So prove looks for test scripts in t/ and not in lib/. Is not "I am a test script" an attribute of a file? What exactly does it mean that a script is found in bin/? "I am a user executable." I see all kinds of metadata crammed into filepaths like Devel-Comments-v1.1.4/t/dc-and-ok-for-vanilla/42-simple-for.t

Project-specific config files may contain all manner of things, some of which I wouldn't call project metadata, some I might. But every project needs to be able to find its config files; more precisely, needs to write and read them. Where a project maintains its data and in what format is metadata.

During development there's a continuous need to create new files and insert them to the current project. Developer should not have to tell the project management tool anything twice; it should already know. While we're on that topic, note that some elements, such as author name and email, chosen license, and boilerplate style are metadata about the developer, perhaps even the developer team.

Use Case

Given a single, consistent project metadata store, I begin a project by writing some of that metadata first. From this, stub files and a skeleton tree are generated, including a number of working, failing tests. I might write a first approximation of the API and so produce, at one stroke, skeleton routines, API tests, and POD synopsis. Any project description I write is mirrored between POD and README (with appropriate changes in formatting).

As I edit the project code, I'm also generating metadata. When I use a module, that module is added to Build.PL, POD, and README. Ideally the metadata for each such used module is read to flesh out the project code and metadata. Minimum required version can be found by asking when a given feature was implemented... and when it started to work.

During the tight edit-test, edit-code loop all results are stored. In case of regress I consult this metadata to find out which feature stopped working and when. Some integration with version control is helpful here; I can say on which branch the feature works.

Some say the project's test suite is the only reliable documentation; but it's not always easy to read. My project is better documented because, as I create tests and implement functions to make them pass, I also create documentation from the generated metadata. It's less work and more reliable that writing docs by hand.

Those pesky config files are created first in the project dir. Perhaps I write a few alternatives. Each is tagged to identify its purpose: foo.conf is for novice users of my project; bar.conf is an example power user config file. Later, at run time, my user consults the metadata to select. He also decides to create a few alternates of his own, which he tags and files away.

I host my project on GitHub; project metadata shows other developers what I'm doing... and what I think I'm doing and will do. If I pull a contributor's patch then his metadata travels with it. Documentation is updatated; every place where his name should be mentioned, it is. Far down the road it will be possible to say who did what.

When I release, the metadata needed to instruct the tarballing is already available. This same metadata goes into the tarball so CPAN has a head start on indexing.

During installation, including install testing, and user run time; project metadata stands ready to inform the user what went wrong where and makes bug report submission a snap. I pull bug reports (together with their metadata) and see quickly where the issue may lie. If not supplied, I can generate a new failing test from the metadata store.

Finally, I can use the correspondances present in the metadata in reverse to deconstruct project elements into generic boilerplate templates... for use in the next project.

Meta Model

A simple project metadata model needs to be written. This is not it, yet; this is just a stab at it.

I imagine a filesystem in which every file and dir has an associated metadata file. Every application, every tool, would know exactly where to go look for metadata about a given file, subdir, project, user, machine, or team.

The blatantly obvious way to do this would be to store a dot-file (hidden file on any OS) in the same dir as the subject; but this might grow ugly quickly -- dot-files and dot-folders are too numerous as it is. Ideally, there would be a less kludgy method of designating metafiles -- a completely new flag. Of course, one might hope that as the standard gained adoption there would be fewer of these dot-files.

Perhaps a reasonable, portable compromise (not requiring every OS to conform to a new filesystem) would be to store all metafiles within a dir in a single subdir:

projects/
    .meta/
        self
    train-set/
        .meta/
            self
        lib/
            .meta/
                self
            Train/
                .meta/
                    self
                    Set.pm
                Set.pm
        t/
            .meta/
                self
            bad/
                .meta/
                    self
                    fubar.t
                fubar.t
            good/
                .meta/
                    self
                    bar.t
                    foo.t
                bar.t
                foo.t
[download]

Note that there's no requirement for every metafile to exist; merely the locations where they may be found are defined. The rules are simple: For any file ./foo, it's metafile may be found at ./.meta/foo -- if it exists. Every dir bar/'s metafile can be found at bar/.meta/self. Additional metadata may be found by going up the path all the way to /.meta.

Strictly, the metafile that's the flip side of foo/bar/ should be found in foo/.meta/bar; and that's viable until you get to /. I elect bar/.meta/self both to eliminate the special root case and on grounds that when tarballed a dir should somehow carry its metadata along with it.

There is a weakness here I won't attempt to obscure: some rogue application or crazy user might create a regular file path/self, which would require a path/.meta/self metafile conflicting with the file intended to store metadata about path/ itself. There are obvious workarounds. A truly clean solution requires tighter integration with the OS (with every OS) and although I sincerely desire such a thing, that amounts to a One Ring fantasy.

Turf Wars

When a single metafile is writable by many different tools, chaos might ensue but for three simple rules:

Any tool may read all metadata.
Any tool may write 'private' metadata to its own section of any metafile.
Any tool may write to the 'public' section.

So a build tool may blithely search the entire metafile of a project looking for some PROJECT_NAME but if it decides to assign a private value to that attribute it must namespace it as build:PROJECT_NAME.

No metadata will be 'locked' or 'secret'. These are best taken care of by platform-specific system permissions and other tools.

Metafile Format

YAML is human-readable and implemented across a wide range of platforms and languages; it's popular and well understood; supports recursive data structures and explicit typing. XML has the advantage of extremely formal schemas but it's cumbersome. JSON shares with straight Perl code the vulnerability of being directly executable by its interpreter.

YAML's typing is explicit but it's not enforced. Rx provides schemata for YAML and this has a decent Perl interface. This may be a sufficient combination.

Another Way To Do It

An obvious alternative is one or more databases; perhaps SQLite. As databases go, it's lightweight.

My feeling is that this is not the right way to go; I could be swayed. My thought is that to offer a metadata lingua franca, an unsophisticated marketplace accessible to all who choose to participate; the lowest possible barrier to entry is best.

Implementation

I have now spent the bulk of my Perl time just trying to upgrade my "workbench" to what I consider a usable standard; and I'm not there yet. I realize that others have simply bitten the bullet; they cobble together solutions out of existing, inadequate tools; do many tasks manually; and keep a lot of metadata in their heads. Personally, I just can't do that.

I envision one unified interface and many small tools to assist other, standard tools to plug into a project metadata system. At minimum, an interface will be provided to allow writing metadata correctly and reading it on demand. Cascading will be taken care of internally so when multiple values of the same element are available, a tool can demand the entire set or the most specific value.

My interest (and my competency) begins and, perhaps, ends with Perl on Linux; but project metadata should be open to all languages and all platforms. I plan to write a standard specification and, in Perl, a reference implementation. Gradually, I'll tie in my favorite tools.

I'm well aware that I'll need to make significant progress before anyone else shows much interest. Please know that all comments will be taken very seriously.

What's in a Name?

Some say nothing and if you're in that camp, feel free to skip. For me, a good name is everything and standard interchangable project metadata model doesn't cut it. So if this sounds like an interesting concept then please, by all means, go ahead and try for a better name. You're welcome to slip me suggestions privately or anonymously.

Summary

The bulk of my efforts over the past few years have run aground on what seems to me issues of metadata. Now I believe I will not be content to move ahead with any project until I have a reliable method for interchanging metadata among the various stages of a project's life.

Thanks

moritz for shoving me out of the CB on this topic.
bigcheese for a few concise words at the right time.

Changes

Suggestions for improvement are welcome and will be incorporated.

2012-01-28:: - new

I'm not the guy you kill, I'm the guy you buy. —Michael Clayton

Back to Meditations