Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

updating big XML files

by dHarry (Abbot)
on Jul 18, 2008 at 15:02 UTC ( #698641=perlquestion: print w/replies, xml ) Need Help??

dHarry has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

A tool used to validate scientific data sets (many GB’s) spits out an XML file full of stuff. The XML file can get rather big (hundreds of MB’s). Typically several sessions are needed to validate a data set and all is recorded in the XML file. When for example errors are fixed in the data set and the tool is rerun the XML file gets updated, at least that was the idea.

Most (if not all?) solutions use the DOM approach. Slurp everything in memory into some data structure, manipulate the data structure and write it back to disk. But with big files this is not workable.

Some of the options mentioned/thought-up:

  • put it in a relational database and let the DBMS do the work
  • put it in a native XML database
  • write custom software
  • use an XML Query implementation

Long ago, in the distant past, I created a Java based solution, parsing large files with SAX and generating DOM trees on-the-fly which were manipulated. I must be getting senile because it seems to have vanished from my memory.

Does anybody know off a more memory friendly (read non-DOM), preferably XML-like, solution? I would like to use an event based parser and update the XML file when needed. Maybe I am asking for too much?

Saludos,
dHarry

Replies are listed 'Best First'.
Re: updating big XML files
by pc88mxer (Vicar) on Jul 18, 2008 at 15:21 UTC
    Have a look at this recent thread: XML parsing - huge file strategy?.

    Personally, I'd try the DBMS approach first especially if you need to update data. With the XML approach you can't make any "in-place edits". Every time you make a change you'll have to make a new copy of the XML file.

Re: updating big XML files
by pjotrik (Friar) on Jul 18, 2008 at 15:30 UTC
    I think DB (relational or xml) is the right approach - I don't see other way to randomly access a large XML file.

    Still, if what you have is a list of changes and you want to apply them to the large XML (I'm not sure I understand well what your validation works like), you may use XML::Twig to parse the file piece by piece, detect the elements you want to change, do the changes, and flush the processed pieces.

      AnyData::Format::XML

      This module allows you to create, search, modify and/or convert XML data and files by treating them as databases without having to actually create separate database files. The data can be accessed via a multi-dimensional tiedhash using AnyData.pm or via DBI and SQL commands using DBD::AnyData.pm. See those modules for complete details of usage.

      The module is built on top of Michel Rodriguez's excellent XML::Twig which means that the AnyData interfaces can now include information from DTDs, be smarter about inferring data structure, reduce memory consumption on huge files, and provide access to many powerful features of XML::Twig and XML::Parser on which it is based.

      Importing options allow you to import/access/modify XML of almost any length or complexity. This includes the ability to access different subtrees as separate or joined databases.

        Thanks! I had heard of XML::Twig but was unfamiliar with the AnyData module. I will definitely give it a try.

      Thanks for the comments! A RDBMS it will be. I will definitely give XML::Twig a try. It has been on my "to do" list for some time.

      Wrt the validation it ranges from simple (i.e. number or date in a certain range) to complex (Right- Ascension/Declination of a satellite orbiting some celestial body at a specific time is “correct”). It cannot be captured by XML Schema or any other schema language. We have lots of IDL code and third party libraries to validate the data.

Re: updating big XML files
by pajout (Curate) on Jul 19, 2008 at 11:32 UTC
    I propose some DB too. Here is my notes:

    - eXist, open source XML database (http://exist.sourceforge.net/).
    Used for huge documents sometimes, XQuery implemented, but, afaik, it is not transactional.

    - proprietary solution (which I am implementing for some reasons), XML data in relational database.
    Transactions + concurrency are big plus. Need for implementing DOM-like editing functions. Difficult (I have decided that it is not required for my project) implementation of XPath or XQuery. Slow parsing, inefficient data representation, but fast updating.

    - stream processing.
    It need not nor DB neither huge RAM. You can write proprietary handlers to process XML doc as stream of events, or you can use something like as http://stx.sourceforge.net/

    Update: It is very interesting for me. If you want, give me more informations and I will propose some solutions or I can write some code too.

      Thanks for the comments! I know eXist I worked with it same years ago (in a Java context). I have done mapping of XML to relational and back myself and it can be tricky! FYI: I have a good document on the topic.

      I think the best way forward is a RDBMS. It was probably a mistake in the first place to do it in XML. Especially since the files get relatively big and updates are necessary. Transactions are not an issue the application is stand alone.

        Another point of view:

        If the data structure is well defined and it will not change (or it will change when new release occurs), sql database can fit the data structure very well.

        But, if the character of the data structure is, for instance, "we know that the tree have following base structure, and some nodes could be enhanced by some data from time to time, some layer keeping generic xml structure could be better than never-ending story of changing definitions of sql tables. In this case, you can choose eXist or implement xml layer in RDBMS.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://698641]
Approved by pc88mxer
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others meditating upon the Monastery: (5)
As of 2019-05-26 05:26 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Do you enjoy 3D movies?



    Results (153 votes). Check out past polls.

    Notices?
    • (Sep 10, 2018 at 22:53 UTC) Welcome new users!