Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??

It looks like you are using XML::Twig in "tree mode" as opposed to "stream mode", which is what I suspected. It means that your code tries to read the entire tree into memory, instead of processing it one chunk at a time, which is what stream mode is good for.

As I understand it, there are two basic approaches to parsing a tree like this. You can first build a tree object that then your program can traverse up and down as it pleases, and manipulate like any other data structure. Alternatively, you can define handlers (aka callbacks) that the parser will invoke whenever it encounters a particular condition (e.g. when it finds a particular tag) as it parses the tree. The latter ("event-driven") approach has the advantage that the parser does not need to read the whole tree into memory; the parsing and whatever you want to do with the parsed text go hand-in-hand. The downside is that your program cannot backtrack to examine parts of the tree that have already been parsed. I'm not very familiar with XML::Twig but it appears that it is a bit of a hybrid of these two basic approaches, in that it lets you install handlers that are triggered by parsing events, but it also lets your program access subtrees of the partially parsed tree. These subtrees can be manipulated as a whole, and then purged from memory. This makes it possible to keep only a small part of the tree in memory, just like with any other event-driven parser, such as XML::Parser, but manipulate entire chunks of the tree (possibly the whole tree) like you could with a tree-building parser.

Anyway, be that as it may, below is my attempt to re-cast your code in terms of XML::Twig handlers. See the docs for XML::Twig for more details. I could not understand the point of changing directories, so that part of the code may be messed up; I commented it out for the purpose of running the code. The program installs two handlers at the time of invoking the constructor for the parser, one for Topic elements and one for ExternalPage elements. The handlers communicate via a shared variable, the %links hash.

Let me know how it goes.

#!/usr/bin/perl use warnings; use strict; use XML::Twig; my $twig= XML::Twig->new( twig_handlers => { 'RDF/Topic' => \&topic, 'RDF/ExternalPage' => \&extpage } ); $twig->parsefile( './sample.xml'); # my $base_dir = 'F:/httpserv'; # chdir $base_dir or die "Failed to chdir to $base_dir: $!\n"; { my %links; sub topic { my ( $twig, $topic ) = @_; if ( $topic->children('link')) { # my $dir = $topic->att('r:id'); # chdir $dir or "Failed to chdir to $dir: $!\n"; $links{ $_->att('r:resource') } = $_ for $topic->children('link' +); } else { %links = (); } $twig->purge; } sub extpage { my ( $twig, $extpage ) = @_; if ( exists $links{ $extpage->att( 'about' ) } ) { print $extpage->first_child_text('d:Title'), "\n"; print $extpage->first_child_text('d:Description'), "\n"; } $twig->purge; # chdir $base_dir or die "Failed to chdir to $base_dir: $!"; } } __END__ British Horror Films: 10 Rillington Place Review which looks at plot especially the shocking features of it. MMI Movie Review: 10 Rillington Place Review includes plot, real life story behind the film and realism in t +he film.

the lowliest monk


In reply to Re^3: Memory errors while processing 2GB XML file with XML:Twig on Windows 2000 by tlm
in thread Memory errors while processing 2GB XML file with XML:Twig on Windows 2000 by nan

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others avoiding work at the Monastery: (6)
As of 2024-04-23 12:10 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found