|Perl: the Markov chain saw|
Giant proxy DBD for multiple databasesby baku (Scribe)
|on Mar 08, 2002 at 17:36 UTC||Need Help??|
baku has asked for the
wisdom of the Perl Monks concerning the following question:
Long have I travailled in the world of MVS, and I return with stupid questions, seeking enlightenment.
This is a very long question about SQL::Statement and doing a JOIN across multiple databases, so I'm tacking a READMORE here...
Here's the troubles: We have a set of disparate (geographically, logically, operating system, file structure) sets of files stored in ~50 systems, almost exclusively MVS. (We're looking at a lot of VSAM sets, which are similar to Berkeley DB, i.e. if you know the key, getting the value is easy, but if you want "column 4," you're going to have to scan the entire table; also some M204 stuff, which the NSA developed and nobody uses...) Our job is to put a query interface that can traverse multiple sources as if they were all in one big, happy database.
Think of it like:
(Yes, not real SQL, but just imagine.) In fact, we can expect just table.column_name syntax coming in, with an explicit mapping of table => source/(table|file).
I told you there was a question coming, right? Disclaimer: I am not married to the idea of using Perl exclusively here... in fact, being able to present an alternative solution alongside Perl would probably bolster the case for using Perl to the bosses and customer both. (Both parties are pretty sure this Unix thing is going to fizzle and we'll all be learning SAS and COBOL in a few years. There's a profound distrust of anything that doesn't require a cleanroom. Thus, the only Unix they see is the AIX tape library attached to the 'frame that takes up 30' of wall space.) However:
It seems to me that this should be possible with SQL::Statement or DBD::AnyData or something similar, and a lot of work creating DBD::AnyData drivers for each of the back-end types (VSAM, flatfile dataset, M204, PDS, ...) that submit a simplified version of the query to the COBOL driver routines on the various hosts. (For reference: COBOL's memory allocation is essentially all static, fixed-length; as a result, it's easiest to parse things out into COBOLese fixed-length records, convert them to EBCDIC, and shove them down a TCP/IP pipe to the very exciting world of CICS+TCP/IP, which is basically a fixed-formatting version of ye olde Berkeley Sockets.) We'll have to create these query interfaces, so I'm not worried about that bit, but DBD::Proxy gives me the willies. Heavy magic.
Also, we have to keep in mind that one query could very easily return many, many gigabytes of data to the middle-level DBD:: to do the JOIN across. That means no in-core processing allowed, we have to stick to the disc for this. Also, parallelizing the query is an absolute necessity.
Am I even looking in the right place? Is there a commercial widget out there that can do the hard bits of this (performing the "join")?
As an aside, I'd love to hear about it if anyone's ever tried accessing MVS datasets from Perl on the 'frame itself (preferably, running in batch-perl under JES/MVS, but even running in a Linux LPAR would be useful.)
For reference, our "small" test dataset is about 100 billion rows in DB/2 on the 'frame.
Missing the *n*x universe more every day I have to reallocate a dataset to allow it to grow...Baku Kurae