Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

XML-Twig: more efficient tree processing

by zuma53 (Beadle)
on Aug 07, 2012 at 16:29 UTC ( #986026=perlquestion: print w/ replies, xml ) Need Help??
zuma53 has asked for the wisdom of the Perl Monks concerning the following question:

Hi--

I've written a script using XML-Twig and it works as planned. But as I am running the script I see a huge amount of memory being used and the processing time taking longer and longer. This is particularly true when the XML tree gets larger and larger.

Sample XML

<root> <typeA>...</typeA> <typeB>...</typeB> <typeC>target</typeC> <typeD>...</typeD> </root>

I'm processing the tree via TwigRoots (no manipulations) by performing a top-down scan using find_nodes, then foreach'ing each of the returned array items. Once I find <typeC> in the tree, everything above the node is irrelevant. Within <typeC>, as I loop through and process it's children, I can purge each child when I proceed to the next one.

Is there a way to do this during the foreach loop?

I've tried purging/disposing of the tree when I am done using it, but none of the memory gets freed.

I guess what I'm looking for is a way to treat the XML as a file, where I can read a line at a time and process it without gulping the whole into memory first (which is what I think Twig is doing).

Thanks.

Comment on XML-Twig: more efficient tree processing
Download Code
Re: XML-Twig: more efficient tree processing
by mirod (Canon) on Aug 07, 2012 at 16:58 UTC
    Is there a way to do this during the foreach loop?

    most likely

    That's about all I can say based on the information you gave. To really answer the question, I would need to see a little bit of code. Do you set handlers on typeC? From your explanations I half-suspect you load the entire tree in memory, but really I have no way to know.

      Here's the nutshell version of what the code does:

      my $twig = new XML::Twig ( TwigRoots => { 'tables/table' => \&processColum +ns, 'views/view' => \&processColumns + } ); $twig->parsefile($SCHEMAS); $twig->dispose; sub processColumns () { my( $twig, $tableTwig ) = @_; my @colList = $tableTwig->find_nodes ("columns/column"); foreach $x (@colList) { blah; } }
        Do you have a $twig_.purge at the end of processColumns?
Re: XML-Twig: more efficient tree processing
by thonar (Monk) on Aug 07, 2012 at 20:08 UTC

    maybe using flush Link cpan XML::Twig:

    ... Once the element is completely processed you can then flush it, which will output it and free the memory ...

    so something like this:

    my $twig = new XML::Twig ( TwigRoots, pretty_print => 'indented', => { 'tables/table' => \&processColum +ns, 'views/view' => \&processColumns + } ); $twig->parsefile($SCHEMAS); $twig->dispose; sub processColumns { my( $twig, $tableTwig) = @_; my @colList = $tableTwig->find_nodes( "columns/column"); foreach $x ( @colList){ blah; } $twig->flush; }

      I tried that. All is seems to do is spit back out the XML on the terminal output (unless it's suppose to do that). I am reading the XML for extracting info then I am done, so printing the tree serves no purpose. This is like closing a file after reading for input; There is no need to print the file when the file is closed.

      Is there a way to flush silently?

        It is suppose to do that, just change:

        $twig->flush;

        to

        $twig->purge;
Re: XML-Twig: more efficient tree processing
by sundialsvc4 (Monsignor) on Aug 08, 2012 at 03:03 UTC

    If the XML tree is large, and you find yourself writing a lot of logic to search through it ... I ... very(!) cautiously ... wonder if you should in fact be using XML::LibXML and using XPath expressions rather than Perl logic to navigate through the tree?

      So instead of using minimal memory with the already written XML::Twig , he should load the giant tree into memory and rewrite the program to use xpath1?

      That don't make sense

        An algorithm that is already started with XML::Twig probably cannot be changed now.   There are two purposes for using other strategies ... to keep large structures out of memory, and to avoid the necessity of procedural logic within the Perl program to traverse them.

      IMO, if you want incremental parsing with callbacks because your tree does not fit in the memory, XML::LibXML is much harder to use than XML::Twig.

        Yup, its quite laborious, pretty much like using raw XML::Parser (or HTML::Parser)

        Where as in Twig you'd say I only want trees out of '/foo/bar/bar' with LibXML (like XML::Parser) you have to build those trees yourself with XML::LibXML::SAX::Parser -- although, there is no reason a XML::Twig type API couldn't be built on top of XML::LibXML::SAX::Parser, its just extra work, given that Twig already exists :)

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://986026]
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others rifling through the Monastery: (7)
As of 2014-08-23 19:32 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The best computer themed movie is:











    Results (178 votes), past polls