Re^2: XML::Twig and threads

by grizzley (Chaplain)
on Nov 26, 2012 at 16:15 UTC

in reply to Re: XML::Twig and threads
in thread XML::Twig and threads [solved]

XML is very simple. I cannot share it as it is company confidential, but it is just like:
<object some_param="abc" other_param="def"> <attrib1>val1</attrib1> <attrib5>val3</attrib5> </object> <object some_param="xxx"> <attrib3>valx</attrib3> <attrib7>valy</attrib7> </object> <object some_param="xyz"> <attrib1>valx</attrib1> <attrib2>valy</attrib2> <attrib3>valx</attrib3> <attrib4>valy</attrib4> <attrib5>valx</attrib5> <attrib6>valy</attrib6> <attrib7>valx</attrib7> <attrib8>valy</attrib8> </object>
Many objects (1752 in 93MB file) and each object has list of attributes (up to 700 in 93MB file).

He further clarified that his concern is yet something else, namely he reads the file into memory, does alterations to some params and writes back to another file. This altered data is used to test the system - e.g. 150 different versions of 10MB file written to one file which is then 1.5GB -> so if we can manage inserting threads into managedObject => \&handle_fasade function it may be really of some help while producing output.

Simple program reading 100MB XML file took 2 minutes and 3.5GB RAM, I think his 30 hours may be out-of-physical memory problem. I'll add more details tomorrow.

Re^3: XML::Twig and threads
by BrowserUk (Pope) on Nov 26, 2012 at 16:36 UTC

    The first thing to say is that that is not valid XML. (A valid XML document must contain a single top level tag.)

    That said, for the purposes of processing, that (arbitrary) XML rule works in our favour and makes writing a program that processes the large file in smallish chunks very simple:

    #! perl -slw use strict; use XML::Simple; use Data::Dump qw[ pp ]; $/ = '</object>'; while( <DATA> ) { last if /^\n+$/; my $xml = XMLin( $_ ); pp $xml; } __DATA__ <object some_param="abc" other_param="def"> <attrib1>val1</attrib1> <attrib5>val3</attrib5> </object> <object some_param="xxx"> <attrib3>valx</attrib3> <attrib7>valy</attrib7> </object> <object some_param="xyz"> <attrib1>valx</attrib1> <attrib2>valy</attrib2> <attrib3>valx</attrib3> <attrib4>valy</attrib4> <attrib5>valx</attrib5> <attrib6>valy</attrib6> <attrib7>valx</attrib7> <attrib8>valy</attrib8> </object>

    That produces:

    C:\test> { attrib1 => "val1", attrib5 => "val3", other_param => "def", some_param => "abc", } { attrib3 => "valx", attrib7 => "valy", some_param => "xxx" } { attrib1 => "valx", attrib2 => "valy", attrib3 => "valx", attrib4 => "valy", attrib5 => "valx", attrib6 => "valy", attrib7 => "valx", attrib8 => "valy", some_param => "xyz", }

    In addition to that allowing the huge file to be processed very quickly in minimal memory, it would -- were the processing requirements of the individual chunks sufficiently taxing to warrant it -- enable multiple individual chunks to be processed in parallel with threading very easily.

    But, if the example is anything like representative of the actual data, that above code will probably allow the entire file to be processed sufficiently quickly -- in a very casual test; less that 2 minutes -- that the need for considering threading disappears completely. The saving coming simply from processing the file in small chunks rather than en masse.

Re^3: XML::Twig and threads
by remiah (Hermit) on Nov 27, 2012 at 09:01 UTC

    Hello grizzley.

    namely he reads the file into memory, does alterations to some params and writes back to another file.

    I guess he is using twig_roots and "twig_print_outside_roots=>1" for that. And I was thinking of Template Tool Kit when I read this post.


