Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

XML for databases?!?! Is it just me or is the rest of the world nutz?

by S_Shrum (Pilgrim)
on May 24, 2002 at 04:51 UTC ( #168977=perlquestion: print w/ replies, xml ) Need Help??
S_Shrum has asked for the wisdom of the Perl Monks concerning the following question:

This isn't a Perl question but more of an XML and it's usage question however it applies here...read on:

Maybe it's me...I'm reading a lot these days about XML and the great things that you can do with it FOR WEB CONTENT...but I'm seeing a large trend (here at perlmonks and abroad) toward generating XML databases and I keep thinking to myself, "Why?". I've played around with a XML database I created and thought, "Heeey, thaat's greeaat...but damn that's a lot of work to generate".

Look, every time you start a new 'record' you need to wrap each 'record' and 'field' in a 'header' and then properly close it. This is a great deal of additional data that could just be omitted if the data was placed into a standard database table.

So my question is...well..."Why do it?". I really haven't seen anything out there that really justifies using XML as a database (...just because you can do it doesn't nessecarily mean you should...).

Please enlighten me if I'm wrong or jump on my soapbox (or both..it's a big box).

======================
Sean Shrum
http://www.shrum.net

Comment on XML for databases?!?! Is it just me or is the rest of the world nutz?
Re: XML for databases?!?! Is it just me or is the rest of the world nutz?
by mojotoad (Monsignor) on May 24, 2002 at 05:32 UTC
    2-second moral of the story: archive, archive, archive with natural language decipherability if you think it's better to err on the side of preservability. And anything else you can think of that ends in "ility".

    PRIORITY ONE MESSAGE FROM STARFLEET COMMAND Attn: James. T. Kirk, Cpt. USS Enterprise Greetings, James. Once again we are shocked at your continued violations of time protocols and your frequent flaunting of edict 45899.68.45 regarding deliberate forking of the time/space continuum despite substantial evidence that these apparent paradoxes do indeed iron themselves out in the different facets of the multiverse. Until we know more about these potential effects, we severely condemn this latest breach of protocol. We do understand that once you found the miraculously preserved digital archives of Dr. Sean Shrum (circa 21st century Earth United States of America) there was some problem deciphering their contents, despite the best efforts of S.O. Spock and your shipboard computer. Normally, had their contents been deliberately encrypted we understand you could have used the latest quantum computing techniques to crack the code. But, since Dr. Shrum chose to use a data format that had been lost in the winds of time, decipherment proved impossible -- especially since we had no idea what sort of information was stored in his archives. Only his reputation based on the surviving seventeen cults spawned from various interpretations of his purported discoveries survived well enough to even enable us to recall his name. Usually in these cases, especially beginning in the 21st century, data formats were specified in "Unicode" plain text markup formats, one of the prevalent extant examples being so-called "XML". We can reliably infer from the works of his contemporaries who chose to save their compendia in XML that he was the exception in this regard. Nevertheless, the apparent loss of his archives serve as no excuse to travel back in time, seduce the man's wife, and pummel the data format out of him under duress. Our top researchers are still classifying the various continued bifurcations in the multiverse -- we can only assume that our colleagues in the other realities, where they still survive, are doing likewise. In the meantime, despite your methodology, we find his varied works to be extremely interesting -- algorithmically useful at best, anthropologically fascinating at worst. Please port his works to your nearest Perl 5678.6.8 Planetary cache. Regards, Fleet Commander Larry Wall XXIX STARFLEET COMMAND OUT
      Normally, had their contents been deliberately encrypted we understand you could have used the latest quantum computing techniques to crack the code. But, since Dr. Shrum chose to use a data format that had been lost in the winds of time, decipherment proved impossible

      Hmm... Let's see...

      # cd /var/lib/postgres/data/pg_xlog # strings 0000000000000011 | more Serial 026324 Serial 5699K11353 cn 29L02 Roof fans Mfg Greenheck Roof fan serial OOC23697 ...
      Looks like postgres (and, in my experience, most other databases) stores its data internally as plain text unless deliberately encrypted or compressed. Yeah, I'll give you that strings isn't likely to tell you the structure of the data, but getting at the content without going through the database engine is trivial.
        Yes, you squarely nailed the weak spot in my parable. I winced when I wrote that bit, but then shrugged in favor of artistic license. The idea that a lost data format could somehow be more incomprehensible than deliberately encrypted data is of course ludicrous. Matt
      Yeah, a nice story. I don't see the point though. XML isn't a magic bullet that somehow makes your data preserve into eternity. If you have data in which the content is mostly what there is, you might as well store it in a flat file without all the XML verbosity around it *. But if you require the relations, then your data is gone when the description of what the elements of your DTD mean is gone.

      And that's the same problem as losing the data format.

      * Why on earth does XML has to be so verbose? It's just LISP, it only needs a lot more characters.

      Abigail

Re: XML for databases?!?! Is it just me or is the rest of the world nutz?
by Aristotle (Chancellor) on May 24, 2002 at 05:36 UTC

    I shan’t say much here, because personally I don’t understand the point in using XML for plain ole tabular data either. I do see where XML is great when your data is deeply structured, and you’re dealing with arbitrary element trees. Then XML is great. But for tabular data? Storage in a ralational database and transport via CSV with column headers fill my needs quite nicely. All else is just buzzword hype, I feel. “Ooooh look ma’, XML!!”

    Makeshifts last the longest.

      I respectfully disagree with your opinion about there not being a point in using XML over plain CSV data because you are ignoring the fact that XML messages can be checked for well-formedness and for syntax and content before the data is accepted into your database. Some of us don't like to crap up our production databases (something to do with job security). CSV on the other hand has no built-in means of validating data so you have to hardwire all edits and validations into your code.

      I think you will find that you can't ever completely trust the person who is sending you the flat file to structure the file according to the proper business rules.

      ~~~~~~~~~~~~~~~
      I like chicken.
        XML doesn't validate itself either. It requires the use of external software to validate it. So, if you're concerned about the integrity of your database and you receive CSV data, you could just as easily write external software to validate that as well.

        (If you reply with something along the lines of "But it's easier to write a DTD and feed it through a generic validator than it is to roll your own CSV validator", please provide concrete evidence to back this up. Based on the few examples I've seen, DTDs seem to be a programming language unto themselves, no less complex (and far more verbose) than the level of perl code that would be required for this task.)

(jeffa) Re: XML for databases?!?! Is it just me or is the rest of the world nutz?
by jeffa (Chancellor) on May 24, 2002 at 06:42 UTC
    I might not be on the same page here - XML for a database? Maybe XML instead of a database ...

    Consider making a FAQ: there is a very generic layout for a FAQ. My idea was to use an HTML::Template template file to describe the layout, and use an XML file to contain the content. Sure i could have used a database, but then i would have to write some sort of a front end to enter in the data - why not use XML to wrap my data instead? It seemed easier to me.

    Here is a stripped down version - it is not perfect and could probably use a revision, but it should be enough to demonstrate. As a matter of fact, i hear that you can even bypass having to use Perl to translate the XML directly to HTML - i'll hopefully learn more about that when my copy of Perl and XML arrives in the mail. (update - doh! i already knew about that - XSLT ;) - and so far that book is some very good reading)

    There are a total of three files - save them with the suggested names and run the .pl file to generate the final HTML. You can redirect the output to an .html file, or even modify the script to be a CGI script instead.

Re: XML for databases?!?! Is it just me or is the rest of the world nutz?
by inblosam (Monk) on May 24, 2002 at 07:48 UTC
    If you are looking for something small and upscalable it may be the right answer (minus all the work it might take you to build and use), kind of like people using form mail to make a csv from their website. Later they can turn around and draw that into a database as they grow. XML is great for export and converting to other things because it is rather simply organized and parsed.

    But that's just my opinion.

    Michael Jensen
    michael at inshift.com
    http://www.inshift.com

      ===This part to inblosam===

      But then again isn't a tabled database "simply organized and parsed" not to mention smaller in size due to the omission of the redundant record and field markup? I could take that a step further and say that tabled data loads faster on large data sources (disk to memory) as a result vs. XML data sources.

      ===This part for everyone else===

      Everyone so far has been "portability-this" and "parsable-that". This is not a question of how one can deal with XML data sources but rather WHY one would choose such a format over a tabluar format (so far I really haven't seen a reason that I couldn't apply to tabled data sources).

      Portability, converting, and parsing are not ADVANTAGES over tabluar databases as the same can be said for tabled data.

      Once again: Just because you can doesn't mean you should.

      Give me some XML database PROs that CANNOT be applied to tabluar databases (flat-files, etc.)

      ======================
      Sean Shrum
      http://www.shrum.net

Re: XML for databases?!?! Is it just me or is the rest of the world nutz?
by Chmrr (Vicar) on May 24, 2002 at 07:59 UTC

    One of the neater properties that XML has is the aforementioned portability. That's pretty cool, but most databases allow one to export data from them without tremendous difficulty. To me, the real reason is when you have data that doesn't fit the standard table format.

    A recent project of mine was to migrate an existing set of static HTML pages into a template-driven, dynamic set of pages. The interesting property that the existing information had was that it had content at varying levels; that is, section 1.1 had some information, while in other areas it nested as deep as 4.1.1.1.1 In addition, any section could have a quiz and/or workbook associated with it. This lent itself, overall, to a structure which would be hard to implement efficiently with standard tables. It proved much easier, conceptually, to plonk all of the data in one XML file, read it at startup, and just grab the needed data out of the data structure thus created.

    Generally, though, I would agree with you -- most databases are optimized for getting data in and out, and accessing data fast when need be. When the data structure gets hairy, though, I may ask you to pass the XML, please.

    perl -pe '"I lo*`+$^X$\"$]!$/"=~m%(.*)%s;$_=$1;y^`+*^e v^#$&V"+@( NO CARRIER'

Re: XML for databases?!?! Is it just me or is the rest of the world nutz?
by ajt (Prior) on May 24, 2002 at 08:22 UTC
    For my sins I use to work for Inso (the bits left are at RBI Inc), and amongst putting the squiggly line under incorrectly spelt words in MS Word, we had a family of SGML and later XML products. Some of the Monastery will remember DynaTag, DynaWeb and finally DynaBase, our SGML/XML based solutions.

    I use to work with DynaBase, and it stored XML in a eXcelon ObjectStore DB, indexed to the tag level. Basically we invented DOM, and held every document in one massive DOM tree in the DB. We could store any kind of structured data you could imagine in a huge XML tree, and find it using a version of search that was XPath aware. If you wanted to find the word foo, in every bar tag with attributes of baz, it could trawl through 2Gb of data faster that you could say Oracle.

    HOWEVER, if you used it to hold tabular data it would grind to a shuddering halt. It was slow to index, slow to search and slow to extract data from.

    If you are building a web content management system using human generated, arbitrarily structured content, and you need good searching tools, then a XML database is the best way to do it. If you want to hold anything with a predictable structure then a relational DB is the way to go.

    The right tools for the right job!

    My humble 2p.

Re: XML for databases?!?! Is it just me or is the rest of the world nutz?
by mirod (Canon) on May 24, 2002 at 08:40 UTC

    XML and databases includes at least 2 different aspects:

    • XML as an exchange format for data, which might come from a DB or not. You just use it as a neutral, standard format, that takes care of annoying things like character encodings. CVS is great if the data source is "close" to the recipient: if they belong to the same organization, if they changes are synchronized, if the quality of the data can be checked and if the recipient can refuse to accept it if it is bad. Otherwise it is much easier to work with XML: the fields are tagged, which means that you can tell easily if one is missing, it is generally easier to deal with changes in the format (don't process extra tags, default non-existing ones) and to work with data coming from different sources (add a simple layer to normalize the data to a single DTD/Schema). You can do this with CVS but generally XML tools make it easier for the programer. Hence "XML for DB: use it as an exchange format". In buzzword-speak I guess this is described as: "XML is very well suited to loosely coupled applications".
    • then there are some emerging XML Data Base systems (Berkeley DB XML, XML DB, see Ron Bourret's site for more products). These are used to store XML documents natively. The problem with XML documents is that their model is a tree, which does not fit nicely in relational tables: either you define a schema for a specific DTD, and it becomes difficult to change it, and you loose one of the great benefits of XML, or you design a generic schema, but then, unless you built a whole layer on top of the RDBMS, in fact developing a dedicated XML DB, performances are real bad, your queries need too many joints to be fast. XML DBs are either built on top of OODBMS (Object Store was used a lot for SGML DBs) or on top of RDBMS, but offer a tree model of the data. This can be very convenient for things like technical documentation or even product data (complex products are usually designed in a top-down way, which leads naturally to a tree model, relational databases are used because they are stable and work well, but a lot of CAD manufacturers are moving towards OO systems to store design data). Hence DBs for XML: "Store XML data in an XML DB".

    Does this answers your question?

Re: XML for databases?!?! Is it just me or is the rest of the world nutz?
by arunhorne (Pilgrim) on May 24, 2002 at 09:43 UTC

    My feeling is that the big point of XML is its homogenous nature. By storing data in an XML database you instantly ensure that it will be orders of magnitude easier to access it from any device (small PDA to big blue) ... and so in this age that is a big issue.

    By using an XML database one ensures that a single XML library - e.g. Xerces that any device can retrieve data from a database. Lets face it... Oracle bindings for PDAs!? In addition to this XML is plain text and therefore transportable over traditional protocols such as HTTP without further drivers.

    Some might argue that using XML for databases is overkill and a waste of space... However I need only point to the falling cost of storage space and processing power.

    Granted for large databases it will be the case that indexing over and XML file could become prohibitive - particularly due to the ability of the user to arbitrarily modify data. As such it may be the case that an XML interface should be provided to a database backed by an Object-Oriented database (the semantics of the Relational Model finds it increasingly hard to capture deeply structured XML). Its just such a shame Google opted for a SOAP API rather than pure XML for remote access :(

    ____________
    Arun
Re: XML for databases?!?! YES!!! With XML, XSL, and SAXON!
by Mission (Hermit) on May 24, 2002 at 13:10 UTC
    For the longest time, I had a problem with the bloat that XML had in comparison to other text data formats (CSV, etc.) so I had a difficult time understanding how XML could be a benefit. More reading uncovered some really cool stuff that XML can do that is a benefit to the web world.

    The ability to template web content (content management systems) is becoming a huge business. The concept is to separate your content (in this case XML) from the style (CSS) and from your design (a template.) Now the template concept can be with HTML::Template using XML::Parser or XML::Simple to extract, but I found a quicker way, and it was built into XML.

    The template that you create is utilizing XSL (Extensible Stylesheet Language) which is a natural parser for XML and applying HTML to it's content based upon your XSL template. For any of you who have created TMPL files with HTML::Template, the XSL is almost identical to it, but you don't have to parse the XML and THEN walk the data through the template!!!

    Although that discovery was neat, it didn't help much, since you still just viewed the XML in a browser window that supported XML, and then it automatically applied the XSL to it for the display. If by chance you didn't have a browser to view the XML, then you were out of luck. It was at this time that I thought about going back to HTML::Template, but then I discovered another of XML's tools... SAX (Simple API for XML). Actually if you do a search for SAXON, it is a small program that is an interface to the SAX. Essentially you:
    saxon -o myhtml.html myxml.xml myxsl.xsl
    Which can be run from a system command from Perl, so there is no issue. You can throw an output (-o filename.html) to make a html file then hand the program the XML and XSL files.

    The benefit is that now everything is seperate and I've preserved my original data. I no longer have to walk back through my HTML trying to find my XML, and I didn't have to parse the XML myself. The XSL simply is a faster process than parsing and doing the HTML::Template.

    For more information on the basics of XSL go here: http://www.w3schools.com/xsl/default.asp.
    For more information on SAXON go here: http://saxon.sourceforge.net/

    BTW: XSL is markup, but you will see mention of XSLT as well... it's simply the XSL Translation which is essentially processing the files. (Just to clear up any confusion.)

    - Mission
      Separation of content (XML files in an XML database) from templates (XSL-T files in an XML database) from presentation styles (CSS, JavaScript & DHTML) is one of the most powerful and useful things a content management and application server combination can do - see also Content management system recommendations? and XSLT vs Templating?.

      However from a Perl programmers perspective, you don't need to use an external standalone XSL-T engine such as Saxon, or Xalan (good though they both are). You can get your own application to do it it's self, directly or via a library, this is how Cocoon and AxKit and many other commercial systems do.

      From the perspective of a Perl user you can use Matts excellent AxKit framework, or his XML::LibXSLT module directly from within your Perl code. I use XML::LibXML to manipulate XML files, template them with XSL-T, and save the output as HTML files! See Mega XSLT Batch job - best approach?, (in answer to Tilly's question, in testing on a 1Ghz Linux box, from one 1Mb XML file I was able to create over 2000 HTML pages, and associated folders in under 30 seconds!)

      If used right XML is a very good tool, just remember it's not right for everything, no matter what some people say!

      Another humble 2p

      Don't get me wrong...but all of what you mentioned can be done from flat-files...heck, even I wrote a script that allows me to multi-template & table my data from flat files (without the bulk of additional record and field tags). This is really only a concern when dealing with large data sources. The XML markup is a p.i.t.a. and incrediblely redundant.

      I'll check the page on Saxon out...no guarentees that I'll convert though.

      ======================
      Sean Shrum
      http://www.shrum.net

        You can take tabulated data, and using one of many templating system you can generate something else from it. This works perfectly well, for many applications.

        If your source file is an XML file, you can have any kind of XML file you want, you can even verify it against an XML Schema or DTD (take your pick) should you want to.

        You can make a range of XSL-Templates to convert the XML file you have into another XML file. The Transformation always gives you a well-formed new XML (or XHTML) file. You can have a range of XSL-Templates for a range of devices, web browser, WAP phone, another server. If you want you can also use XSL-FO and generate a PDF file should you wish.

        Should you want you can pipeline one XML document into one round of templating after another, as each step always generated well-formed XML.

        XML, XSL-T, XSL-FO, DTD, XML-Schemas are all public standards, you can use a range of tools, on different operating systems, in different languages, and most of the time things behave the way you expect them to. Using one XML file and one XSL-template I can generate one HTML file on my Linux box and NT box, using Saxon, Xalan, LibXSLT or even MS-XML, it's pretty predictable.

        One of the key strengths of XML is that the coding and the content/templates are kept apart. By using XML, you can use any code you want, and any content and it works! As your team gets bigger it means that the coders code and the mark-up people mark-up.

        There is a good thread on templating versus XSL-T here: XSLT vs Templating? where people put many sound arguments for both sides. Matts in particular as a strong advocate of XML says many things that I could add here.

        I'm not saying that XML is the best solution, just that it's one solution, and for larger more complex applications, it's focus on flexible structures, and it's wide scale use, does make it the tool of choice.

        Update: Typos fixed, and XSLT link added

Re: XML for databases?!?! Is it just me or is the rest of the world nutz?
by webfiend (Vicar) on May 24, 2002 at 14:14 UTC

    Using XML for broad, general databases doesn't make much sense, no. It's just another wheel being reinvented - in a fairly awkward way, too.

    I can see it helping in narrow contexts, however. If your application only needs to store a particular kind of data, then a full-tilt relational database might be a little bit of overkill. The FAQ generation that was mentioned before is a perfect example. On my own, I've used XML to store configuration details for a meta-search tool, news items for a small site, and even a simple guestbook CGI.

    I prefer XML over CSV because of structure issues. I am a geek of Very Small Brain, and I like it when my data storage is self-documenting. Of course I document my own CSV files every time - it would be wrong not to :-) But there have been too many times where I examine a client's data files and find:

    • No documentation ("The 3rd column says 'yes'. 'yes' to what?)
    • Profoundly arcane documentation (Ah, the third column is for 'DO4'. Uh ... what?)

    You can still find bad or no documentation in an XML file, but generally I've had good luck coming across clearly named tags. Even in the worst cases, I've been able to figure out quite a bit from the structure of the markup. (I'm not sure what 'DO4' is, but it doesn't have anything to do with the address, because that's way over in this other element.)

    Of course, if your project has a lot of data with intricate relations that needs to scale way waay up, then you're back to relational databases. XML is convenient for data storage, yes, but it is not the best tool in all cases.

    So yeah - XML as a database for small, very specific sets makes sense to me. And yes - the world is nutz.


    "All you need is ignorance and confidence; then success is sure."-- Mark Twain
Re: XML for databases?!?! Is it just me or is the rest of the world nutz?
by tmiklas (Hermit) on May 24, 2002 at 16:35 UTC
    IMHO XML is not the best and most briliant solution for everything aroud. That is true - you can describe anything you want - even create a really powerfull database, but first you have to convince me :-) to do that :-)
    IMHO if I wanted to have an XML database (in my meaning of database) - with unique keys, indexes, etc. I would have a lot of code to write. Sure - I can do that, but is it worth of my time?! I don't think so... I'll use some SQL database then.
    Anyway - if we are talking about the simple database (how about DB?!), then you can always use plain-text ASCII files with fields of some format.
    Which one is faster - read whole database checking for specified conditions or load everything into memory and then check/select/whatever? I don't know, but I know, that XML is the best commonly used glue to exchange data of any format! It's simple, it's universal, but writing an XML database larger than a few records is a mistake (IMvHO).

    Greetz, Tom.
Re: XML for databases?!?! Is it just me or is the rest of the world nutz?
by rucker (Scribe) on May 24, 2002 at 16:39 UTC
    I imagine a lot of other people share your sentiment because I haven't seen an XML database worth using (yet), but I hope that there will be one in the future. If you don't have a reason to use an XML database, don't. That said, a few reasons come to mind why you might want to use one.

    1. You get the information in XML (perhaps via SOAP), and you use the XML (perhaps with XSLT) for your application, and you need to store it. In this case, you save the trouble of converting it twice (XML->RDBMS->XML).

    2. You have no control over the data you are receiving. Today, you might get data for a, b, and c, but tomorrow it might have a, c, x, y, and z. Also, you need to store that additional information in a way you can "easily" use.

    3. Your application is simple, but the data is complex (yet easily fits into XML). Why spend a lot of time creating a complex RDBMS scheme when an XML database could handle the job?

    Also note that (in my experience), reduced development effort is usually worth additional overhead. Even though this is obvious, I'll say it anyway: you have to weigh the trade-off between system overhead and development effort in light of specific project requirements. If we were stricly concerned with system overhead, we wouldn't be using perl at all... or XML :)

    Rucker

Re: XML for databases?!?! Is it just me or is the rest of the world nutz?
by mpeppler (Vicar) on May 24, 2002 at 17:01 UTC
    Using an XML database (or using a storage/retrival mechanism that understands XML) is really useful if the domain of the data you need to handle is ill-defined.

    For example I just read an interesting article on FpML - a specification for data exchange for Over The Counter financial instruments between banks (swaps, FRAs, etc).

    The problem here is that because these instruments aren't traded on an exchange the instruments aren't standardized, and have different behavior and characteristics depending on who the participants in the deal are.

    With an XML-based system you can store the information regarding these instruments without having to re-define your database every other day because some smart trader has found a new way of doing a particular trade...

    This doesn't solve all the problems of course - you still have to interpret the data correctly to perform appropriate accounting/trade reporting/confirmation/etc., but it at least enables the database to store the information without recoding it.

    Michael

Re: XML for databases?!?! Is it just me or is the rest of the world nutz?
by Starky (Chaplain) on May 24, 2002 at 18:09 UTC
    Should you use XML or not?

    The answer to this question, as is the answer to so many technical questions, is, "It depends."

    The main advantages I've found to XML in practice are:

    • There are a bevy of tools available to parse, manipulate, and validate XML data. This is part of its appeal as a standard data format. Other standards would have this advantage if folks rallied behind them as they have XML, but they haven't so they don't. You don't need to do much work to parse, search, or validate an XML document. If you are trading data between applications / environments / languages, this can be _hugely_ advantageous. Its strength by simply being a widely adopted standard can't be underestimated.
    • It represents heirarchical data very well. If your data is not heirarchical, then you really need to think twice before jumping on the XML bandwagon.
    • The format is (rather easily) human-readable. I realize this is not the most important aspect of the specification, but it can be very nice in practice.

    There you have it. To me, it's not much more complicated than that. If your data only needs treatment by in-house developers who understand the schema and know a few things about SQL, then it's far more trouble than it's worth. If the data is heirarchical and needs to be represented in HTML and other kinds of documents in a variety of ways or if diverse tools need to exchange data, then it is a very nice tool.

    Those are the business / technical considerations.

    The personal considerations for me have been that, like SQL, XML is something that I know I will use time and time again in a variety of situations. So when I had an opportunity to learn to use it, I jumped on it. So while you may or may not decide that it fits your business needs, your human capital will be that much more valuable if you become comfortable with it out sooner rather than later.

    Hope this helps :-)

Re: XML for databases?!?! Is it just me or is the rest of the world nutz?
by MadraghRua (Vicar) on May 24, 2002 at 21:45 UTC
    I'd like to agree and extend the last comment by Starky. In the biological sciences we have a lot of different types of data to describe, DNA sequence data, transcribed messenger RNAs, proteins, etc. Then we have a load of different types of experiments that we do, such as measuring levels of transcription, levels of proteins in differing kinds of cells, and so on. We want to be able to compare all these differing types of data in a flexible way. Scientists need to mix and match data as we want to, to support our differing ideas and hypotheses. Finally there are a bunch of different tools that we use. Some of these have reasonably common output formats, some tools have very unique formats. Some tools have been around for a very long time, some will be released tomorrow.

    As you can probably appreciate, the ability to mix and match data and tools in a very flexible way is pretty paramount in research. So XML and all its ilk are pretty useful to us.

    So what I'm mainly seeing is the use of databases that store one kind of data, eg a sequence database, a genome database, a transcription database, etc. Then there might be a series of annotation based databases - comments or analyses of the primary data. Rather than creating one big database, folk use DTDs to describe the relationships of the data in the different databases to each other, to create a XML output that can be in turn parsed and fed into differing combinations of analyses tools to support new and changing ideas. This approach is allowing greater flexibility in querying data, reduces the need to tinker with database schemas so much and genereally makes life easier.

    So concering your post, I would think that if I were working with a fairly simple system, I would be a lot less inclined to put in the effort to develop an XML based data exchange system. If I were going to be working on something that I would like to be widely used by other groups, I would consider going to XML. If I were going to be working on a large project involving several databases, some of which were off site, and using a combination of local and remote tools, I would be using XML.

    Having written all this, what I'm curious about is has anyone experience with trying to use XML in very large projects. For instance, much of this work has been done on a relatively small scale so far. If you were going to be working with gigabyte or terabyte amounts of data, would XML scale well as a distribution method to pass data between dfferent programs? For instance a mass spectroscopy center would be generating several million data points daily, each data point having 10 to 20 keys and values. An expression center might generate similar amounts of data. You would need to store these data into databases and then schlep some or all of it to downstream programs for analysis. Would an XML based data exchange mechanism cope well in this type of situation? What would be the drawbacks apart from bandwidth?

    MadraghRua
    yet another biologist hacking perl....

Re: XML for databases?!?! Is it just me or is the rest of the world nutz?
by Dr. Mu (Hermit) on May 25, 2002 at 17:10 UTC
    My own timid excursion into XML was made for two reasons:

    1. Tied hashes between two systems (e.g. my hosted website and my local mirror) can be incompatible, depending on the db engines employed.
    2. Tied hashes normally encompass only one level. Hashes containing arrays and other hashes can be cumbersome to implement.

    What I was hoping to achieve was a portable, readable file capable of containing a complex Perl data structure, that I could slurp into memory, manipulate, and write back out.

    XML::Simple is the putative answer to these goals. It was pretty simple, but not brain-dead simple. Some of the arrays embedded in my hash have only one element. Apparently XML, by itself, is unable to distinguish between a single-element array and a scalar, so there is no a priori one-to-one correspondence between plain XML and a Perl data structure. XML::Simple gives you a way to force certain elements to be arrays when the XML file is read, but this amount of finagling was contrary to my objectives.

    Would I use XML again? Probably not as a general-purpose embodiment of a Perl data structure. For that, I would look around for another format, another module -- or write my own. But my mind is still open for other applications. With this much smoke and heat, there's gotta be a fire somewhere!

      I can't really say much about XML, since I don't know very much about it, but I *will* say that Data::Dunper has served me well for what you are talking about Dr. Mu.

      Yes of course, updating the "database" can be a heck of a time, especiallly via a web server (ie: CGI script). Only one process can update the file at a time, otherwise you get scrambled files, race conditions appear, etc. etc.

      This means that you must lock a "lock file" before dealing with the data file, so that only one process can access the script at a time. This works great for low-traffic sites, but have more than one access to the script every second, and serving time slows down big time, as each new request has to "get in line" to have access to the database.

Re: XML for databases?!?! Is it just me or is the rest of the world nutz?
by mattr (Curate) on May 26, 2002 at 11:26 UTC
    Thank you for expressing the doubt lots of people (me anyway) have been scared to say. Yes, I think a lot of the world is totally nuts about XML this and that, it is just not the be-all and end-all. Here are some good times to think about using XML that I've come up with though.

    to separate design from code from style in big web project.
    to let you (maybe) easily do mobile interface in future for smaller project too
    to handle changing hierarchical data structures
    to quickly search tree-based documents with node-aware search paradigm (xpath) which does rock.
    to maximize interoperability if you are sending lots of data to another party, i.e. data glue. E-commerce transactions made this kind of interchange format a holy grail some years ago.
    to process ML-based data handed to you, including programming with strong tree metaphor.
    to drop tabular data into an XML db you're stuck with..
    to work with cognitive science relational/hierarchical semantic data like grammar trees (thinking of hypernym tree in Lingua::WordNet)
    ditto, to work with data from cognitive science that can only be meaninfully be represented or accessed by in a tree-based paradigm, for example statements in predicate calculus in the OpenCyc AI project. The huge knowledge base is a morass of interrelated assertions which themselves are nested logical statements. Horrible, wonderfully neat stuff. See java xml api for it.

    Yggdrasil for example is a neat-looking XML-based database, that is it is supposed to represent data internally as tree-structured data, which would make it very good for certain applications and bad for others. I wish I had a good problem that needed me to use it.. Actually I do have some hierarchical data but shallow enough to use serialized objects in ordinary object store.

    As for data interoperability, consider genome processing, which seems to be the new benchmark for large projects with changing definitions of data that would otherwise drive you insane. A poster above mentioned use of XML in that case though at least for medium-sized projects. A different paradigm (BoulderIO, see bio.perl.org) seems to be popular which allows differently defined structured data sets to be processed in a pipeline system.

    It would seem that implementing too much XML too deep in your system could be real bad unless everything is XML-based. But used as a way to share schemas, could be fantastic.

    One thing I can say for sure is that I have seen some very slow XML processing systems. So display speed is a big issue for me. In particular I know of one server which uses XML to reformat HTML files for different browsers, which the developers are considering redeveloping in C++ since Java was too slow (or maybe incompetently developed, haven't seen the code myself). So you need to do a tradeoff, possibly. My guess is that initiatives like Sleepy Cat's will make those kind of products easier to develop.

    The other thing is that you may have to spend a lot of time on interface and manuals if you are going to be handing XML tools to end-users, since their understanding of it and useability will be directly proportional to what they get out of it. I've written an introduction to xpath for end-users, which was not easy to do, and also seen the user interface and xpath search capabilities to be major competitive points in the software.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://168977]
Approved by Zaxo
Front-paged by moodster
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others meditating upon the Monastery: (5)
As of 2014-08-29 06:00 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The best computer themed movie is:











    Results (275 votes), past polls