Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight

Re: XML::Twig and threads

by mirod (Canon)
on Nov 26, 2012 at 12:52 UTC ( #1005636=note: print w/replies, xml ) Need Help??

in reply to XML::Twig and threads [solved]

Yes, it doesn't make much sense to have the whole handler in a separate thread. What could be done is for the handler in the main thread to extract the data it needs, and then to do the processing in a separate thread. that is assuming that the data can be extracted just from the current element and that processing it doesn't change the original XML.

An other option might be to split the initial XML and then to process those in parallel. xml_split, a tool that comes with XML::Twig could do this.

That said, it is indeed stange that it takes so long to process the data. I somehow doubt that the XML parsing is responsible for this.

Replies are listed 'Best First'.
Re^2: XML::Twig and threads
by grizzley (Chaplain) on Nov 26, 2012 at 15:59 UTC
    I did a simple test:
    use XML::Twig; use threads; $start = time; $t= XML::Twig->new(twig_roots => {managedObject => \&handle_fasade}); $t->parsefile('inputFiles/input100MB.xml'); print "Time: ", time-$start; sub handle_fasade{ }
    and the output was:
    # Time: 149s, 3.5GB RAM # Script quits after 71s
    So you are right - 2 minutes is not much time. What worries me is 3.5GB RAM, because of further clarification in Re^2: XML::Twig and threads.
      So you are right - 2 minutes is not much time. What worries me is 3.5GB RAM

      I'll bet 1 to 1p that if you comment out the use threads;, the memory consumption will barely change.

      It is not at all uncommon for a 100MB XML file to translate into 3.6GB of ram requirement once it has been parsed and the equivalent data structure constructed.

      The memory requirement has nothing to do with threading. Just Perl's well-known tendency to trade memory for cpu.

      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

      RIP Neil Armstrong


        I won't bet obviousness :)

        I've updated question node with example XML and script doing the job. Input file is read again in each of few hundred iterations. I was thinking about reading it once and then make in-memory copy in each iteration, but this will include swap on HDD and definitely won't speed up the code. Another way would be to store all changes write altered version to file, undo stored changes etc. That would save copying, but on the other hand as there are many changes - structure storing changes might be of same size as original. The best would be to somehow write to file original structure line after line and replace fragments on-fly.

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1005636]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (4)
As of 2021-09-28 16:22 GMT
Find Nodes?
    Voting Booth?

    No recent polls found