http://www.perlmonks.org?node_id=457407


in reply to Re: Memory errors while processing 2GB XML file with XML:Twig on Windows 2000
in thread Memory errors while processing 2GB XML file with XML:Twig on Windows 2000

Hi,

Thank you for the adive. Actually, as the file is huge, some nice people suggested me to try XML:Twig as it is more efficient. My XML case is a little bit funny, if you don't mind, please have a look in more details below:

My XML sample file is shown below:

Basically, the XML file has two key parallelled nodes: <Topic/> and <ExternalPage/>. If there is a <link/> child existing in <Topic/>, <ExternalPage/> node will be existing for showing more detailed information about the content of this <link/> such as <d:Title/> and <d:Description/>.

However, not every <Topic/> node has one or more <link/> child, so I need to write a loop to find out if <link/> is a child of <Topic/> nodes. If there are some <link/> nodes existing, I will check each of <ExternalPages> to output more information.

<RDF>
<Topic r:id="Top">
<catid>1</catid>
</Topic>

<ExternalPage about="">
<topic>Top/</topic>
</ExternalPage>

<Topic r:id="Top/Arts">
<catid>2</catid>
</Topic>

<Topic r:id="Top/Arts/Movies/Titles/1/10_Rillington_Place">
<catid>205108</catid>
<link r:resource="http://www.britishhorrorfilms.co.uk/rillington.shtml"/>
<link r:resource="http://www.shoestring.org/mmi_revs/10-rillington-place.html"/>
</Topic>

<ExternalPage about="http://www.britishhorrorfilms.co.uk/rillington.shtml">
<d:Title>British Horror Films: 10 Rillington Place</d:Title>
<d:Description>Review which looks at plot especially the shocking features of it.</d:Description>
<topic>Top/Arts/Movies/Titles/1/10_Rillington_Place</topic>
</ExternalPage>

<ExternalPage about="http://www.shoestring.org/mmi_revs/10-rillington-place.html">
<d:Title>MMI Movie Review: 10 Rillington Place</d:Title>
<d:Description>Review includes plot, real life story behind the film and realism in the film.</d:Description>
<topic>Top/Arts/Movies/Titles/1/10_Rillington_Place</topic>
</ExternalPage>
</RDF>

my codes are shown below which is quite straightforward:

#!/usr/bin/perl
use warnings;
use strict;
use XML::Twig;

my $twig= new XML::Twig;
$twig->parsefile( "./content.example.txt");
my $root = $twig->root;
chdir "F:/httpserv"; #set initial directory
foreach my $topic ($root->children('Topic')) {
if ($topic->children('link')){ #if element <link/> is a child of <Topic/>, change directory for index writing
chdir $topic->att('r:id');
foreach my $link ($topic->children('link')) {
foreach my $extpage ($root->children('ExternalPage')) {
if ($link->att('r:resource') eq $extpage->att('about')){
print $extpage->first_child_text('d:Title'), "\n";
print $extpage->first_child_text('d:Description'), "\n";
$twig->purge; #I'm not sure if I need to purge in each loop.
}
}
$twig->purge;
}
$twig->purge;
chdir "F:/httpserv"; #reset directory pointer to local root directory
}
}
Thanks again,
  • Comment on Re^2: Memory errors while processing 2GB XML file with XML:Twig on Windows 2000

Replies are listed 'Best First'.
Re^3: Memory errors while processing 2GB XML file with XML:Twig on Windows 2000
by tlm (Prior) on May 17, 2005 at 04:38 UTC

    It looks like you are using XML::Twig in "tree mode" as opposed to "stream mode", which is what I suspected. It means that your code tries to read the entire tree into memory, instead of processing it one chunk at a time, which is what stream mode is good for.

    As I understand it, there are two basic approaches to parsing a tree like this. You can first build a tree object that then your program can traverse up and down as it pleases, and manipulate like any other data structure. Alternatively, you can define handlers (aka callbacks) that the parser will invoke whenever it encounters a particular condition (e.g. when it finds a particular tag) as it parses the tree. The latter ("event-driven") approach has the advantage that the parser does not need to read the whole tree into memory; the parsing and whatever you want to do with the parsed text go hand-in-hand. The downside is that your program cannot backtrack to examine parts of the tree that have already been parsed. I'm not very familiar with XML::Twig but it appears that it is a bit of a hybrid of these two basic approaches, in that it lets you install handlers that are triggered by parsing events, but it also lets your program access subtrees of the partially parsed tree. These subtrees can be manipulated as a whole, and then purged from memory. This makes it possible to keep only a small part of the tree in memory, just like with any other event-driven parser, such as XML::Parser, but manipulate entire chunks of the tree (possibly the whole tree) like you could with a tree-building parser.

    Anyway, be that as it may, below is my attempt to re-cast your code in terms of XML::Twig handlers. See the docs for XML::Twig for more details. I could not understand the point of changing directories, so that part of the code may be messed up; I commented it out for the purpose of running the code. The program installs two handlers at the time of invoking the constructor for the parser, one for Topic elements and one for ExternalPage elements. The handlers communicate via a shared variable, the %links hash.

    Let me know how it goes.

    the lowliest monk

      Hi tlm,

      Thank you so much for your informative and meaningful advice. So far your codes run very well, but if you don't mind, I'd like to ask you some questions about your codes:

      1) You set twig_handlers to two key elements and called corresponding subroutines. What I couldn't understand is, as you didn't assign parameters in &topic and &extpage, what's the meaning of:
      my ( $twig, $topic ) = @_;

      2) About %links, the hash table you created:
      Does the following codes mean you add each children link for att('re:resource')?
      $links{ $_->att('r:resource') } = $_ for $topic->children('link');

      3) About your two sub routines, I understand that at first, you walk through the whole xml document to find out the fist <topic/> node and if it has <link/> child, you save link information to hash table and then examine <ExternalPage/> followed it. If it doesn't have a <link/> child, you reset the hash to empty, here is my question, if the hash is empty, will you examine <ExternalPage/> as well coz I'm really doesn't know how these two subroutine connect with each other or what is the run order of them? Is it run one <Topic/> then all <ExternalPage/> or all <Topic/> first then All <ExternalPage/>?

      Thanks again for your time!

        What you're missing is a grasp of "event-driven" programming. It's a distinct style of programming, just as OOP is (though they are not mutually exclusive; XML::Twig is both OO and event-driven). An event-driven parser is a good example of this programming model. (It is also the norm in GUI programming.) Such a parser has a core functionality (namely, parsing text according to some syntax), but the programmer can customize it by "registering" subroutines with the parser, to be associated with specific parsing events (e.g. finding a closing tag). The parser will then invoke these pre-registered subroutines, with a pre-specified set of arguments, at the appropriate times during the parsing. These subroutines one "registers" with the parser are called "callbacks" or "handlers"1.

        The subs topic and extpage are two such handlers. They get invoked by the parser whenever it finishes parsing a Topic or ExternalPage section. They each receive two arguments from the parser: the XML::Twig object and the XML element that the parser just finished parsing. (This answers your first question.)

        These two subroutines run separately from each other; in other words, neither of them calls the other one. This rules out direct communication between the two subs. One way around this is for them to communicate through shared variables (i.e. %links). In this case indirect communication is necessary since extpage cannot backtrack over the XML to see what links, if any, were found by topic. In the code I wrote only the keys of %links are used; saving the actual link objects as the values corresponding to these keys is just there for some potential future use. The code would work just as well if those values were all 1, say.

        Note that these two subroutines run multiple times during the parsing operation. This is a key point. It is not the case that all the calls to topic happen first, and then all the calls to extpage. The multiple calls to these methods alternate.

        ...coz I'm really doesn't know how these two subroutine connect with each other or what is the run order of them?

        The parser takes care of invoking the subroutines at the right time during the parsing; in this case, they get invoked once the parser finishes parsing a Topic or ExternalPage section, respectively. This all happens as the result of the call to $twig->parsefile( './sample.xml'); it is this call that sets off the whole sequence of events that ultimately cause the handlers to be invoked by the parser.

        1Sometimes they are also called "hooks", although I have also seen the term "hook" used to refer to the places in the source code for the parser (or whatever) where the callbacks are invoked. You can think of these "hooks" as places provided by the author of the parser where the programmer using the parser can "hang" custom code from.

        Update: The first chapter of HOP has a nice discussion of callbacks.

        the lowliest monk