Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

Re: XML::Twig and threads

by BrowserUk (Patriarch)
on Nov 26, 2012 at 12:58 UTC ( [id://1005639]=note: print w/replies, xml ) Need Help??


in reply to XML::Twig and threads [solved]

he wants anyway to add threads to this script and speed it up.

Tell him it simply will not work. XML::Twig uses an OO interface, and sharing objects between threads, whilst possible, will never speed things up. This is because the time cost of accessing shared memory (in Perl) are far higher than accessing private memory.

If you (he) would care to share a (small) sample of the XML in question that shows the repetitive structure, then there is probably an an effective way to process it in parallel, by breaking up the top level using non-XML parser techniques and then using several non-shared XML parser instances withing threads. But it will be necessary to see a realistic sample to advise further.

The real problem here is that the design of XML requires that an XML document be treated as an indivisible entity, which -- if you stick to the XML parsing rules -- makes parallel processing of XML not just difficult, but impossible. By design. It is lamentable that XML has become so ingrained in peoples psyches.


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

RIP Neil Armstrong

Replies are listed 'Best First'.
Re^2: XML::Twig and threads
by grizzley (Chaplain) on Nov 26, 2012 at 16:15 UTC
    XML is very simple. I cannot share it as it is company confidential, but it is just like:
    <object some_param="abc" other_param="def"> <attrib1>val1</attrib1> <attrib5>val3</attrib5> </object> <object some_param="xxx"> <attrib3>valx</attrib3> <attrib7>valy</attrib7> </object> <object some_param="xyz"> <attrib1>valx</attrib1> <attrib2>valy</attrib2> <attrib3>valx</attrib3> <attrib4>valy</attrib4> <attrib5>valx</attrib5> <attrib6>valy</attrib6> <attrib7>valx</attrib7> <attrib8>valy</attrib8> </object>
    Many objects (1752 in 93MB file) and each object has list of attributes (up to 700 in 93MB file).

    He further clarified that his concern is yet something else, namely he reads the file into memory, does alterations to some params and writes back to another file. This altered data is used to test the system - e.g. 150 different versions of 10MB file written to one file which is then 1.5GB -> so if we can manage inserting threads into managedObject => \&handle_fasade function it may be really of some help while producing output.

    Simple program reading 100MB XML file took 2 minutes and 3.5GB RAM, I think his 30 hours may be out-of-physical memory problem. I'll add more details tomorrow.

      The first thing to say is that that is not valid XML. (A valid XML document must contain a single top level tag.)

      That said, for the purposes of processing, that (arbitrary) XML rule works in our favour and makes writing a program that processes the large file in smallish chunks very simple:

      #! perl -slw use strict; use XML::Simple; use Data::Dump qw[ pp ]; $/ = '</object>'; while( <DATA> ) { last if /^\n+$/; my $xml = XMLin( $_ ); pp $xml; } __DATA__ <object some_param="abc" other_param="def"> <attrib1>val1</attrib1> <attrib5>val3</attrib5> </object> <object some_param="xxx"> <attrib3>valx</attrib3> <attrib7>valy</attrib7> </object> <object some_param="xyz"> <attrib1>valx</attrib1> <attrib2>valy</attrib2> <attrib3>valx</attrib3> <attrib4>valy</attrib4> <attrib5>valx</attrib5> <attrib6>valy</attrib6> <attrib7>valx</attrib7> <attrib8>valy</attrib8> </object>

      That produces:

      C:\test>t-XML.pl { attrib1 => "val1", attrib5 => "val3", other_param => "def", some_param => "abc", } { attrib3 => "valx", attrib7 => "valy", some_param => "xxx" } { attrib1 => "valx", attrib2 => "valy", attrib3 => "valx", attrib4 => "valy", attrib5 => "valx", attrib6 => "valy", attrib7 => "valx", attrib8 => "valy", some_param => "xyz", }

      In addition to that allowing the huge file to be processed very quickly in minimal memory, it would -- were the processing requirements of the individual chunks sufficiently taxing to warrant it -- enable multiple individual chunks to be processed in parallel with threading very easily.

      But, if the example is anything like representative of the actual data, that above code will probably allow the entire file to be processed sufficiently quickly -- in a very casual test; less that 2 minutes -- that the need for considering threading disappears completely. The saving coming simply from processing the file in small chunks rather than en masse.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

      RIP Neil Armstrong

      Hello grizzley.

      namely he reads the file into memory, does alterations to some params and writes back to another file.

      I guess he is using twig_roots and "twig_print_outside_roots=>1" for that. And I was thinking of Template Tool Kit when I read this post.

      regards.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1005639]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others romping around the Monastery: (4)
As of 2024-03-29 05:19 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found