in reply to Memory errors while processing 2GB XML file with XML:Twig on Windows 2000
Without seeing some code it is impossible to give any concrete advice. How have you taken advantage of XML::Twig to process only chunks of the XML tree? Have you considered using a purely event-driven parser like XML::Parser?
Re^2: Memory errors while processing 2GB XML file with XML:Twig on Windows 2000
by nan (Novice) on May 16, 2005 at 10:56 UTC
|
Hi,
Thank you for the adive. Actually, as the file is huge, some nice people suggested me to try XML:Twig as it is more efficient. My XML case is a little bit funny, if you don't mind, please have a look in more details below:
My XML sample file is shown below:
Basically, the XML file has two key parallelled nodes: <Topic/> and <ExternalPage/>. If there is a <link/> child existing in <Topic/>, <ExternalPage/> node will be existing for showing more detailed information about the content of this <link/> such as <d:Title/> and <d:Description/>.
However, not every <Topic/> node has one or more <link/> child, so I need to write a loop to find out if <link/> is a child of <Topic/> nodes. If there are some <link/> nodes existing, I will check each of <ExternalPages> to output more information.
my codes are shown below which is quite straightforward:
Thanks again, | [reply] |
|
It looks like you are using XML::Twig in "tree mode" as opposed to "stream mode", which is what I suspected. It means that your code tries to read the entire tree into memory, instead of processing it one chunk at a time, which is what stream mode is good for.
As I understand it, there are two basic approaches to parsing a tree like this. You can first build a tree object that then your program can traverse up and down as it pleases, and manipulate like any other data structure. Alternatively, you can define handlers (aka callbacks) that the parser will invoke whenever it encounters a particular condition (e.g. when it finds a particular tag) as it parses the tree. The latter ("event-driven") approach has the advantage that the parser does not need to read the whole tree into memory; the parsing and whatever you want to do with the parsed text go hand-in-hand. The downside is that your program cannot backtrack to examine parts of the tree that have already been parsed. I'm not very familiar with XML::Twig but it appears that it is a bit of a hybrid of these two basic approaches, in that it lets you install handlers that are triggered by parsing events, but it also lets your program access subtrees of the partially parsed tree. These subtrees can be manipulated as a whole, and then purged from memory. This makes it possible to keep only a small part of the tree in memory, just like with any other event-driven parser, such as XML::Parser, but manipulate entire chunks of the tree (possibly the whole tree) like you could with a tree-building parser.
Anyway, be that as it may, below is my attempt to re-cast your code in terms of XML::Twig handlers. See the docs for XML::Twig for more details. I could not understand the point of changing directories, so that part of the code may be messed up; I commented it out for the purpose of running the code. The program installs two handlers at the time of invoking the constructor for the parser, one for Topic elements and one for ExternalPage elements. The handlers communicate via a shared variable, the %links hash.
Let me know how it goes.
| [reply] [d/l] |
|
Hi tlm,
Thank you so much for your informative and meaningful advice. So far your codes run very well, but if you don't mind, I'd like to ask you some questions about your codes:
1) You set twig_handlers to two key elements and called corresponding subroutines. What I couldn't understand is, as you didn't assign parameters in &topic and &extpage, what's the meaning of:
my ( $twig, $topic ) = @_;
2) About %links, the hash table you created:
Does the following codes mean you add each children link for att('re:resource')?
$links{ $_->att('r:resource') } = $_ for $topic->children('link');
3) About your two sub routines, I understand that at first, you walk through the whole xml document to find out the fist <topic/> node and if it has <link/> child, you save link information to hash table and then examine <ExternalPage/> followed it. If it doesn't have a <link/> child, you reset the hash to empty, here is my question, if the hash is empty, will you examine <ExternalPage/> as well coz I'm really doesn't know how these two subroutine connect with each other or what is the run order of them? Is it run one <Topic/> then all <ExternalPage/> or all <Topic/> first then All <ExternalPage/>?
Thanks again for your time!
| [reply] |
|
|
|