XML and file size

silent11 has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.

Re: XML and file size
by roundboy (Sexton) on Jan 07, 2003 at 02:12 UTC

Keep in mind that appending to a text file, and "appending" to an XML file, are not exactly the same, because the XML file will (should?) have some internal structure. For example, suppose you choose one file per month. Then a journal file might look something like this:

<entries month="2003-01">
  <entry date="2003-01-01">
    <p>Hung over.</p>
  </entry>
  <entry date="2003-01-02">
    <p>Still hung over.</p>
  </entry>
  <entry date="2003-01-03">
    <p>Better today. Phew.</p>
    <p>Wish I didn't have to go to work, though.</p>
  </entry>
</entries>
[download]

When you create your next entry element, you'll be sticking it inside the entries element, rather than at the end of the file. So instead of opening the file for appending, you can either:

Do some creative file scanning and rewriting (bad!); or
Slurp the XML into a DOM-like structure, modify that tree, and write it all back out.

There's nothing inherently wrong with choice 2, but it'll get slower and more expensive as you add more entries. Choice 1 will get increasing more difficult and crufty as soon as you try to do something other than append entries.

As chromatic (I believe it was) mentioned, you also need to think about what you want to do with these entries. Search them? Display arbitrary sets based on date/subject/keywords/etc? If you ever want groupings other than the one you're thinking about for storage, then you may prefer to store each entry separately.

HTH!

--roundboy

[reply]
[d/l]
[select]

Re: Re: XML and file size

by gjb (Vicar) on Jan 07, 2003 at 02:23 UTC

Rather than slurping the whole XML into a DOM for appending information, a SAX approach can be used. Simply pass through everything but the closing root tag. On encountering it, emit the new node and then the closing root tag.
This is much faster and much more memory friendly.

But yes, appending to XML is expensive.

Just my 2 cents, -gjb-

[reply]

Re: Re: Re: XML and file size

by roundboy (Sexton) on Jan 07, 2003 at 18:38 UTC

Thanks, a very good point. A SAX parser is the robust way to implement "creative file scanning", and I just didn't think of it. But the point regarding this alternative remains true: namely, that as you start doing additional tasks beyond reading and appending, it'll get progressively harder to get it done.

Regardless, since the goal of the project is to learn new technologies, maybe the best approach would be this: do a little reading, and a lot of thinking, about how XML document types can be used to represent various structures, and then consider what kinds of structural relationships will exist within the journal data. Then choose a data representation, and write a schema or DTD (even if no validation is needed, it's good practice). Finally, play with the various tools including both kinds of parsers. I'd even suggest poking around with a q&d "parser" that builds on something like

my ($tag, $attrs, $body) = /<(\w+)\s+(.*?)>(.*?)</\1>/s;
[download]

--roundboy

[reply]
[d/l]

Re: XML and file size
by chromatic (Archbishop) on Jan 07, 2003 at 01:35 UTC

It depends on what kind of data you want to store. What is a journal entry? What kind of operations do you need to perform on your entries? It's hard to tell you what your best option is when you haven't yet decided what you want to do.

[reply]

Re: XML and file size
by PodMaster (Abbot) on Jan 07, 2003 at 01:51 UTC

chromatic

I'd use a date/time string to name the files, and maybe stuff a month's worth into it's own subdirectory.

As for XSL, check out http://axkit.org/ ;)

MJD says you can't just make shit up and expect the computer to know what you mean, retardo!
** The Third rule of perl club is a statement of fact: pod is sexy.

[reply]

Re: XML and file size
by dws (Chancellor) on Jan 07, 2003 at 06:37 UTC

I'd like to implement some other technologies for the sake of gaining experience and practice. ... I ask you, my monk friends, what is the most efficent way to store theses journal entries in XML files?

If your primary goal is to gain experience, don't worry about efficiency just yet. Try things. Measure. Get a feeling for the strengths and weaknesses of various approaches. Play around.

[reply]

Re: XML and file size
by Matts (Deacon) on Jan 07, 2003 at 08:36 UTC

Generally I tend to go with the database route, because at the end of the day they're great for storing lots of bits of similar data, and I always turn stuff into XML anyway for output generation via axkit. That way I get the best of both worlds.

[reply]

Re: XML and file size
by osama (Scribe) on Jan 07, 2003 at 05:18 UTC

if your aim is only to practice XML/XSL then by all means go ahead and use either "one file per entry" or "one file"... I don't see any advantages for the rest.

Remember: You don't have to use XML just because it's a HOT Buzzword. I see many people are using new technologies just for the sake of using it.- Appending XML is expensive as a poster said, but searching in XML is much more expensive!!!!

[reply]

Re: Re: XML and file size

by grantm (Parson) on Jan 08, 2003 at 02:52 UTC

searching in XML is much more expensive!!!

I'm not sure that I agree with that. If "searching" means checking whether a word or phrase occurs in the file then the time required to search would be almost identical for XML versus plain text - assuming you use the same code for each (save for the fact that the XML file will be a bit more verbose so extra I/O might be required in some cases).

On the other hand if you want to do semantic searches (eg: does this word or phrase occur within <title> ... </title> tags?) then sure that will take more CPU cycles than a plain text match but that is merely extra cost for extra power.

[reply]

Re: Re: Re: XML and file size

by osama (Scribe) on Jan 08, 2003 at 03:29 UTC

I have nothing against XML, and it can be used to store your data in some cases, but I think it's better suited for data interchange/SOAP/Having different formats for the same data.I'm actually comparing XML files to a database, to which they are frequently offered as an alternative, storing XML in a database is another thing.

I never heard of anybody saying "I'll use XML files instead of text files"... It's mostly "Use XML and you don't need a database", I just cannot Imagine a search in 200,000 XML files looking for text in <title> tags. but imaginig "select body from pages where title like '%text%'" is easy.

I think storing you data in any type of files XML/text/CSV... is a waste of time if you have lots of data (>1000 records? le ss? more?)

[reply]

Re^4: XML and file size

by grantm (Parson) on Jan 08, 2003 at 07:22 UTC


Perl-Sensitive Sunglasses
	PerlMonks