ftumsh has asked for the wisdom of the Perl Monks concerning the following question:

Lo all, I will be creating 50+ meg xml files. I don't want it all to be in memory tho, for obv reasons. Are there any modules I could use to help me with this. A bit like XML::Twig, only in reverse? Indeed, would XML::Twig do the job, or will I have to roll my own? Thx John

Replies are listed 'Best First'.
Re: creating large xml files
by holli (Abbot) on Jun 22, 2006 at 10:57 UTC
    I am creating large xml files via the Template Toolkit and the output directive hack. Works like a charm.

    holli, /regexed monk/
      can't you just output to a filehandle? (note that I'm not a TT expert, I'm just wondering)
        Hey tinita! Nice to hear from you. I gonna be in Berlin for the world cup finals. I'd like to see you again. But back to the technical stuff:

        You can specify an ouput file when calling process(), but that doesn't solve the memory problem, because the whole output will still be concatenated to a single string. Atm the only way to make TT to print directly is to use the "hack" mentioned above. However if you need to print to an output file you can use this:
        use warnings; use strict; use Template; use Template::Directive; $Template::Directive::OUTPUT = 'print $main::OUT '; our $OUT; open $OUT, ">", "output.txt"; my $tmpl = Template->new(); my $text = "<test>[%data%]</test>"; $tmpl->process (\$text, {data=>'xxx'}); close $OUT;

        holli, /regexed monk/
        hi holli =)

        I didn't know that the string isn't printed immediately in TT. In HTML::Template::Compiled, for example, it is.

        p.s.: message me when you're in Berlin.

Re: creating large xml files
by samtregar (Abbot) on Jun 22, 2006 at 15:40 UTC
    I've used XML::Writer to write large files - it has a simple interface and makes it hard to write a file that won't parse. If you've got more complex requirements you can use XML::SAX with XML::SAX::Writer to write large documents without holding them in memory. That would let you validate as you write with XML::Validator::Schema, for example.


      SAX is almost certainly the answer you want for writing big XML files. Even better, you can generate them directly from your model.

      That said, SAX generators can be fairly wordy to write, so XML::Generator might be a nice alternative.
Re: creating large xml files
by Tanktalus (Canon) on Jun 22, 2006 at 13:41 UTC

    "Obv reasons"? I'm not sure I follow. In a day and age where Java seems all too popular (I currently have one Java app running that is using 1116m virtual and 657m resident), I don't exactly follow why ~80-100m in memory should be a concern.

    Especially since the guys who wrote your OS, whether that's Windows, Linux, or BSD (Mac), or pretty much any other modern OS, have already solve the problem of using a hard disk as if it were RAM. So if you really do run out of memory and start swapping, it actually can often be faster than if you try to be sneaky. Usually, the OS will swap out some other process first while yours runs, which means that you'll get to stay all in memory.

    I suppose my suggestion is to start with what works, and worry about the optimisations later. You may not really actually need them. Do it all in memory since that's probably way simpler. Optimise it later.

      XML in Perl (using XML::Parser for instance) tends to get blown up significantly when it is stored as a hash of hashes or some other not-so-memory-efficient way.
      My guess is that ftumsh is worried about that.
      "Optimise it later." Preoptimization is evil, but there's no reason not to set a technical requirement to try have /some/ sanity. If his machine has a gig of ran and he needs to run in parallel under load, it's a fine technical requirement. What if he's running on a low-memory device?
        You could be right. He could need to keep memory requirements down for any of the reasons you've suggested. But doesn't the simple fact that we've started playing "guess why he needs to keep the memory requirements down" mean that it's not "for obv reasons"?

        (Don't mind me... Morning came too early today and I'm in a weird mood...)

      Optimise it later

      That's the academic answer, not the practical one. It assumes your time is worth nothing, or that optimization is easy. Neither is true.

      Ever try to get management buy-in for a total re-write of an app that wasn't written with performance designed in from the start? It's painful.

      Performance, like security, needs to be built in from the start. If it isn't, you can pray for the so-called "80%-20% Co-incidence" to save you, or you can re-write it from scratch using tighter algorithms and faster data structures. Total re-writes cost a lot of time and money; partial fixes tend to end up as cheap hacks.

      Unless you have no clue as to what you're writing, just do it right the first time, so you don't have to do it over later. Remember, if your code gets too slow, (and yes, I've seen this happen) it may actually become too slow to properly refactor. If a comparison run takes several days to run, small, incremental changes become very expensive.

      If the app is fast and tight, making it better is cheaper, because the cost of testing is cheaper; and the cost of refactoring is cheaper. For a one off script, this doesn't matter; but for a large scale project, performance is more critical than stuffy academics realize. In business, time is money.

        On the contrary. It's immensely practical. You're assuming your time is worth less than an extra GB of RAM. Assuming that ignoring all the optimisations saves you 4 hours of time in development, and about 50% (another 2 hours) in debugging, and 1GB of RAM is worth $120, you need to be paid $20/hour or less in order to justify wasting time on such an optimisation.

        In actuality, most programs will take much longer than that to write optimised - even from the ground up - especially in areas where you're unsure of the optimisation required. And RAM, CPU speed, and disk speed are all getting cheaper, not more expensive.

        As I've said before, it's not performance that matters, but responsiveness. If you get it responsive without wasting time on unneeded optimisations, why spend time/money on it to get it "faster"? You're right that time is money - you gotta take into account the programmer's time/money, too. I don't know about you, but I haven't made under $20/hour since I left university. It's cheaper to buy the stick of RAM and move on to the next business problem to be solved.

      the os is linux. By obv reasons I mean that I have only a certain amount of ram. To cut down on it's usage for parsing xml I use XML::Twig. To parse a 50meg xml takes > 1gig memory. It's a multi tasking environment so if 100 files land at the same time it won't be long before the machine grinds to a halt. So when writing 50meg of xml I would want to do it a chunk at a time. I could roll my own but then I have to handle all the encoding and what have you. I'll have a look at SAX.