http://www.perlmonks.org?node_id=169207


in reply to XML for databases?!?! Is it just me or is the rest of the world nutz?

I'd like to agree and extend the last comment by Starky. In the biological sciences we have a lot of different types of data to describe, DNA sequence data, transcribed messenger RNAs, proteins, etc. Then we have a load of different types of experiments that we do, such as measuring levels of transcription, levels of proteins in differing kinds of cells, and so on. We want to be able to compare all these differing types of data in a flexible way. Scientists need to mix and match data as we want to, to support our differing ideas and hypotheses. Finally there are a bunch of different tools that we use. Some of these have reasonably common output formats, some tools have very unique formats. Some tools have been around for a very long time, some will be released tomorrow.

As you can probably appreciate, the ability to mix and match data and tools in a very flexible way is pretty paramount in research. So XML and all its ilk are pretty useful to us.

So what I'm mainly seeing is the use of databases that store one kind of data, eg a sequence database, a genome database, a transcription database, etc. Then there might be a series of annotation based databases - comments or analyses of the primary data. Rather than creating one big database, folk use DTDs to describe the relationships of the data in the different databases to each other, to create a XML output that can be in turn parsed and fed into differing combinations of analyses tools to support new and changing ideas. This approach is allowing greater flexibility in querying data, reduces the need to tinker with database schemas so much and genereally makes life easier.

So concering your post, I would think that if I were working with a fairly simple system, I would be a lot less inclined to put in the effort to develop an XML based data exchange system. If I were going to be working on something that I would like to be widely used by other groups, I would consider going to XML. If I were going to be working on a large project involving several databases, some of which were off site, and using a combination of local and remote tools, I would be using XML.

Having written all this, what I'm curious about is has anyone experience with trying to use XML in very large projects. For instance, much of this work has been done on a relatively small scale so far. If you were going to be working with gigabyte or terabyte amounts of data, would XML scale well as a distribution method to pass data between dfferent programs? For instance a mass spectroscopy center would be generating several million data points daily, each data point having 10 to 20 keys and values. An expression center might generate similar amounts of data. You would need to store these data into databases and then schlep some or all of it to downstream programs for analysis. Would an XML based data exchange mechanism cope well in this type of situation? What would be the drawbacks apart from bandwidth?

MadraghRua
yet another biologist hacking perl....

  • Comment on Re: XML for databases?!?! Is it just me or is the rest of the world nutz?