http://www.perlmonks.org?node_id=616725

bsb has asked for the wisdom of the Perl Monks concerning the following question:

I'll outline the situation, what I'd like to achieve, how I think it could be attacked. I'm writing in the hope that this will help me find code out there that I can use, or at least to learn from people who have experience with related problems.

Existing Framework:

We provide a validation tool for a few simple fixed width data formats, it checks basic structure, domains and inter-record relationships. There's also a HTTP::Server::Simple utility to view the validation results, from a high-level or line by line. These are bundled together with PAR and deployed on Windows (although I develop on Linux). The database used is SQLite with Class::DBI::Loader generated classes and CDBI::Sweet for joining (yes, these were cutting edge at the time and no I don't want to change unless it's going to buy me a lot of time).

The formats are defined in custom YAML data dictionaries that specify the names, labels, types, lengths, domains, record layouts and relationships. The rest of the utilities configure themselves from this metadata.

See for an example ERD-thingamibob

Extension Goals:

Another stage of checking is to be added, which will be run after the existing checks. It will check logical consistency within and between records (not correcting just marking errors for display), checking against pre-supplied historical information, checking aggregates, counts and calculations against thresholds that are either predetermined or are themselves based on the file's data.

These rules will have be of varying levels of complexity and will probably have reflect the ad-hoc knowledge of the current distributed grey matter implementations. I envision the automatic validation nibbling its way up the complexity chain as more rules can be coded within it.

Some example rules:

Iff there Full-time Equivalent data has been provided for Xs there should also be Salary information for Xs

The average Salary (derived from other data) should be within a certain range.

There's plenty more.

Ideal:

I'd like a set of declarative rules, that have a machine understandable format which can be translated to SQL to checks that may find and log invalid records, but may also simply log the violation (eg. if the aggregate information is unusual). The same declarative format (maybe something like an AST) could be used to document the rules as text, images or useful error message (templated, filling in line numbers, field names, values and information links).

The rules should be able to calculate and name derived values that can be shared (and hopefully not recalculated, note that the processing stages are one-way and one at a time).

Better yet, the mini-language declaring the constraints would be intelligible by the current domain experts (non-programmers) and writable by them.

In my dreams...

Semi-Realistic Goal:

A framework that with rules that comprise of a name, documentation, a priority, maybe an error template and the constraint. The rules will probably be written in the highest level of abstraction that can express them and that exists. Until the high levels are written I'll be condemned to SQL, perhaps some rules will remain as SQL snippets due to there complexity, or to avoid performance issues.

The abstraction hierarchy may look like:

 SQL => SQL snippets => AST => Domain Specific Language

For the moment, this is all wishful thinking.

Over the next month I'll work on the lowest levels but would like to have a sketch of the higher-levels to guide me.

Ideas:

I'm looking at Tangram::Expr, QDRDBMS, SQL::Routine and Rosetta for hints There seems to be some alignment between what I'm after and the SQL AST field, but the alignment is probably not perfect.

I see a SQL AST as a good target for my to-be-invented-or-stolen DSL. The SQL AST would effectively define formulas for derived values, criteria for flagging violations, and constraints which need to be checked (different from db constraints).

The rules -> ast -> sql conversion could potentially be done at build time, therefore even Pugs and other languages can be considered.

Other ideas still floating around are Prolog, SQL::Abstract, S-expressions, DSL->SQL done using Perl or non-Perl tool chains, an Object::Declare-like Perl 5 DSL hack, Lua as a SQLite procedural language, Meta-Lua + Lua as the DSL, human-kind lives in peace and harmony forever, etc, etc, etc.

My current plans is write a few examples of varying complexity in SQL, which I'll probably have to do any way, and see what can be refactored. I'll post that when done to solicit further guidance.

Other Considerations:

The Windows deployment with PAR can be tricky to wrangle so I'm wary of big frameworks, platform dependency and C/XS, especially since I use pseudo-forking. I still have flashbacks...

Also, I should make clear that the database isn't being updated or more data inserted, so caching of intermediates is ok.

I hope that someone will be able to point me to useful code or warn me when there be dragons. Advice welcome,

Brad

Replies are listed 'Best First'.
Re: Fishing for constraint => SQL designs
by mattr (Curate) on May 22, 2007 at 14:02 UTC
    Hmmm nobody biting yet? This comes up often like in Rules-based Perl? in 2001. I even wanted something declarative a number of times myself. Kind of didn't want to answer hoping someone would have some info but okay let's get the dialog flowing.. left field: You could use AI::Prolog which is pure perl, which is a good thing. And you could use Draxler's PrologSQL compiler of sqi-prolog to sql or Ciao Prolog to SQL bridges there's the pl2sql tool.(Ciao Prolog).

    Not that I've used or recommend these. I wish there was something elegant and natural languageish. Where's theDamian when you need him.. Well okay there is something intriguing over here, CAPE which mixed CLIPS with perl. But the links are dead... ah here we are.

    As its name suggests, CAPE is a combined programming environment which allows programs to be written in Clips, a forward-chaining rule-based system which was originally developed by NASA, and Perl. Clips rules can call Perl code from either side (i.e. from either the "pattern" part, or the "action" part). Perl can add information to the Clips working memory, and can interact with objects defined using the Clips Object Oriented Language (COOL). CAPE provides mechanisms for switching information between the two programming systems, allowing the programmer to exploit the strengths of both

      Excellent. More confusing possibilities :) It's reassuring to know that others have encountered this problem. Thanks for the links.

      I also asked on the QDRDBMS list and got a reply from it's designer. The thread

Re: Fishing for constraint => SQL designs
by doom (Deacon) on May 22, 2007 at 19:13 UTC
    Can I suggest you explain a little more about what exactly it is you hope to achieve with this? It sounds like you're diving into the ORM problem, or some aspect of it, and I have to say I've never really understood the fascination with it (is the relational model so much harder to understand than the object model? Why not just do it relationally?).

    ORM: The Vietnam of Computer Science?

    (It appears that I'm an advocate of the "Table Data Gateway" design pattern.)

Re: Fishing for constraint => SQL designs
by dk (Chaplain) on May 22, 2007 at 21:24 UTC
    Did you look at DBIx::Perlish? The DSL it introduces is the perl itself, similar to Tangram's perl expressions, but implemented by parsing of perl optree (rather than operator overloading), and therefore much more flexible. I don't think it covers everything you named as the ideal solution, but still.
Re: Fishing for constraint => SQL designs
by nferraz (Monk) on May 23, 2007 at 10:32 UTC

    You could describe the rules in XML:

    <table name="foo"> <field name="a" required="1"/> <field name="b" validate="\d+\.\d+\.\d+"/> <field name="c" required="if a"/> </table>

    These could be easily translated to perl:

    package Model::Validate::Foo; sub on_insert { my %value_of = @_; if ( !$value_of{a} ) { die "a is required"; } if ( $value_of{b} !~ /\d+\.\d+\.\d+/ ) { die "invalid b"; } if ( !$value_of{c} ) { die "c is required if a" if $value_of{a}; } }

    Or, perhaps you could have constraints declared like this:

    <table name="foo"> ... <constraint name="avg_a" value="average(a) > 100"/> </table>

    Which would generate a sub like:

    sub avg_a { # called on insert, update of table foo # check if average(a) > 100 }