http://www.perlmonks.org?node_id=1005636


in reply to XML::Twig and threads [solved]

Yes, it doesn't make much sense to have the whole handler in a separate thread. What could be done is for the handler in the main thread to extract the data it needs, and then to do the processing in a separate thread. that is assuming that the data can be extracted just from the current element and that processing it doesn't change the original XML.

An other option might be to split the initial XML and then to process those in parallel. xml_split, a tool that comes with XML::Twig could do this.

That said, it is indeed stange that it takes so long to process the data. I somehow doubt that the XML parsing is responsible for this.

Replies are listed 'Best First'.
Re^2: XML::Twig and threads
by grizzley (Chaplain) on Nov 26, 2012 at 15:59 UTC
    I did a simple test:
    use XML::Twig; use threads; $start = time; $t= XML::Twig->new(twig_roots => {managedObject => \&handle_fasade}); $t->parsefile('inputFiles/input100MB.xml'); print "Time: ", time-$start; sub handle_fasade{ }
    and the output was:
    # Time: 149s, 3.5GB RAM # Script quits after 71s
    So you are right - 2 minutes is not much time. What worries me is 3.5GB RAM, because of further clarification in Re^2: XML::Twig and threads.
      So you are right - 2 minutes is not much time. What worries me is 3.5GB RAM

      I'll bet £1 to 1p that if you comment out the use threads;, the memory consumption will barely change.

      It is not at all uncommon for a 100MB XML file to translate into 3.6GB of ram requirement once it has been parsed and the equivalent data structure constructed.

      The memory requirement has nothing to do with threading. Just Perl's well-known tendency to trade memory for cpu.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

      RIP Neil Armstrong

      blockquote

        I won't bet obviousness :)

        I've updated question node with example XML and script doing the job. Input file is read again in each of few hundred iterations. I was thinking about reading it once and then make in-memory copy in each iteration, but this will include swap on HDD and definitely won't speed up the code. Another way would be to store all changes write altered version to file, undo stored changes etc. That would save copying, but on the other hand as there are many changes - structure storing changes might be of same size as original. The best would be to somehow write to file original structure line after line and replace fragments on-fly.