http://www.perlmonks.org?node_id=1005623

grizzley has asked for the wisdom of the Perl Monks concerning the following question:

Friend has a script which is parsing XML file using XML::Twig. 10MB input file with 2000 structures that could be parsed in parallel. He has even bigger files, up to 100MB and script is parsing it even for 30 hours. I already advised him to check why is it taking so long but he wants anyway to add threads to this script and speed it up. And he ended up with script producing error Free to wrong pool 3080610 not 589260 at C:/Perl64/lib/XML/Parser/Expat.pm line 432.

We have minimized the script to following one:

#!perl -l use XML::Twig; use threads; use Thread; $t= XML::Twig->new(twig_roots => {managedObject => \&handle_fasade}); $t->parsefile('inputFiles/wcel3g.xml'); sub handle_fasade{ my $currentTh = Thread->new( \&thrsub ); $currentTh->join; } sub thrsub{ }

If I comment out join or parsefile or even replace my $currentTh = Thread->new( \&thrsub ); with my $currentTh = Thread->new( { return 0 } ); error does not occur. What is wrong in this code?

Actually the longer I prepare this node the clearer it is to me that this approach is senseless. Is it even possible to parse XML in parallel? I would rather say XML parsing must be done in one thread and afterwards processing data can be done in parallel. Am I right?

update: fake input xml:

<?xml version="1.0" encoding="UTF-8"?> <blah version="2.1" xmlns="blah.xsd"> <someData type="actual" name="ActualConfiguration" id="1"> <header> <log dateTime="2012-05-08T10:10:10" action="export"/> </header> <managedObject class="NOKFLF:FLF" distName="MNE-PET/FLF-1000" id="6666 +666000000093362" timeStamp="2012-04-16T18:17:50" vendor="XXX" version +="S14"> <extension name="system_parameters"> <p name="$modifier">UNAUTHENTICATED</p> <p name="$state">operational</p> </extension> <list name="FLFOptions"> <p>0</p> <p>1</p> <p>2</p> <p>3</p> <p>4</p> <p>5</p> <p>7</p> <p>8</p> <p>10</p> <p>12</p> <p>13</p> <p>16</p> <p>17</p> <p>20</p> <p>24</p> <p>25</p> <p>29</p> <p>31</p> <p>32</p> <p>34</p> <p>35</p> <p>36</p> <p>37</p> <p>41</p> <p>42</p> <p>45</p> <p>46</p> <p>47</p> <p>48</p> <p>50</p> <p>51</p> <p>54</p> <p>56</p> <p>61</p> <p>62</p> <p>68</p> <p>69</p> <p>72</p> <p>73</p> <p>74</p> <p>88</p> <p>96</p> <p>107</p> <p>108</p> <p>109</p> <p>117</p> <p>118</p> <p>120</p> <p>123</p> </list> <p name="name1">31</p> <p name="name2">31</p> <p name="name">BRLE8</p> <p name="name4">25</p> <p name="name5">50</p> <p name="name6">10</p> <p name="name7">80</p> <p name="name8">20</p> <p name="name9">100</p> <p name="nameA">20</p> <p name="nameB">2</p> <p name="xyz">1</p> <p name="dbf">0</p> <p name="battery1">30</p> <p name="cpu2">150</p> <p name="FLFType">10</p> <p name="lower">40</p> <p name="upper">60</p> <p name="releaseLimit">4</p> <p name="delay">5</p> <p name="connection1">14</p> <p name="connection2">7</p> <p name="connection3">12</p> <p name="connection4">12</p> <p name="connection5">14</p> <p name="disableExt">0</p> <p name="disableInt">0</p> <p name="frPenalty">3</p> <p name="emerC">1</p> <p name="extraXLSNumber">6</p> <p name="extraBSW">64</p> <p name="RelPri">1</p> <p name="epHoUse">0</p> <p name="frTchim">30</p> <p name="freeDowngrade">95</p> <p name="freeUpgrade">4</p> <p name="freqMeas">30</p> <p name="xCalc">0</p> <p name="param1">4</p> <p name="param2">5</p> <p name="param3">0</p> <p name="param4">0</p> <p name="param5">30</p> <p name="param6">0</p> <p name="param7">10</p> <p name="param8">127</p> <p name="param9">1</p> <p name="param10">0</p> <p name="param20">255</p> <p name="param30">0</p> <p name="dparam1">150</p> <p name="dparam4">100</p> <p name="dparam6">186</p> <p name="dparam8">512</p> <p name="dparam10">30</p> <p name="cparam3">120</p> <p name="cparam5">50</p> <p name="cparam7">50</p> <p name="cparam9">384</p> <p name="cparam11">384</p> <p name="sparam1">21</p> <p name="sparam2">26</p> <p name="sparam3">30</p> <p name="sparam4">20</p> <p name="sparam5">25</p> <p name="sparam6">30</p> <p name="sparam7">24</p> <p name="sparam8">29</p> <p name="sparam9">120</p> <p name="sparam0">60</p> <p name="sparama">60</p> <p name="sparams">240</p> <p name="sparamd">4</p> <p name="sparamgf">1</p> <p name="sparamh">255</p> <p name="sparamh">10</p> <p name="sparamj">30</p> <p name="sparamk">3</p> <p name="sparami">18</p> <p name="sparamu">0</p> <p name="sparamy">8</p> <p name="sparamt">0</p> <p name="sparamr">1</p> <p name="sparamer">1</p> <p name="sparame">9</p> <p name="sparamw">7</p> <p name="somanyparams1">10</p> <p name="somanyparams2">90</p> <p name="somanyparams3">10</p> <p name="somanyparams4">70</p> <p name="somanyparams5">90</p> <p name="somanyparams5">20</p> <p name="somanyparams6">20</p> <p name="somanyparams7">1</p> <p name="somanyparams0">1</p> <p name="somanyparams8">1</p> <p name="somanyparams9">20</p> <p name="somanyparamsa">120</p> <p name="somanyparamss">120</p> <p name="somanyparamsd">1</p> <p name="somanyparamsf">400</p> <p name="somanyparamsg">100</p> <p name="somanyparamsh">200</p> <p name="somanyparamsj">25</p> <p name="somanyparamsk">1</p> <p name="somanyparamsl">66947</p> <p name="somanyparamso">66947</p> <p name="somanyparamsi">66947</p> <p name="somanyparamsu">8</p> <p name="somanyparamsy">0</p> <p name="somanyparamst">65535</p> <p name="somanyparamsr">5</p> <p name="anotherparam1">0</p> <p name="anotherparam2">5</p> <p name="anotherparam3">3</p> <p name="anotherparam4">1</p> <p name="anotherparam5">5</p> <p name="anotherparam6">3</p> <p name="anotherparam7">1</p> <p name="anotherparam8">3</p> <p name="anotherparam9">2</p> <p name="anotherparam0">4</p> <p name="anotherparamq">3</p> <p name="anotherparamw">12</p> <p name="anotherparame">6</p> <p name="anotherparamr">3</p> <p name="anotherparamt">6</p> <p name="anotherparamy">9</p> <p name="anotherparamu">12</p> <p name="anotherparami">20</p> <p name="anotherparamo">10</p> <p name="anotherparamp">5</p> <p name="anotherparama">20</p> <p name="anotherparams">10</p> <p name="anotherparamd">5</p> <p name="anotherparamf">20</p> <p name="anotherparamg">10</p> <p name="anotherparamh">5</p> <p name="anotherparamj">20</p> <p name="anotherparamk">10</p> <p name="anotherparaml">5</p> <p name="anotherparamz">20</p> <p name="anotherparamx">10</p> <p name="anotherparamc">5</p> <p name="anotherparqamb">30</p> <p name="anotherparqamn">0</p> <p name="anotherparqamm">0</p> <p name="anotherparqamas">4152</p> <p name="anotherparqams">0</p> <p name="anotherparqamd">15</p> <p name="anotherparqamf">1</p> <p name="anotherparqamg">10</p> <p name="anotherparqamh">5</p> <p name="anotherparqamj">128</p> <p name="anotherparqamk">0</p> <p name="anotherparqaml">127</p> </managedObject> </someData> </blah>
Simple line preparing huge fake input xml (replace number 100 with appropriate value):
perl -p0e "s/<managedObject.*?>.*<\/managedObject>/$& x 100/se" testin +small.xml > testin.xml
Script doing the job (to be optimized):
use XML::Twig; $inputFile = 'testin.xml'; $outputFile = 'testout.xml'; $loop = 563; $netType = "MNE-1v1"; $mx2G = "002"; $my2G = "02"; $objID = 1; $bID = $firstElementID = 1; $managedObjectsAmount = 0; $someID = 0; $segmentID = 0; $header = "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<!DOCTYPE raml +SYSTEM 'blah.dtd'>\n<blah version=\"2.1\" xmlns=\"blah.xsd\">\n<someD +ata type=\"actual\" name=\"ActualConfiguration\" id=\"1\">\n<header>\ +n<log dateTime=\"2012-05-08T10:10:10\" action=\"export\"/>\n<log date +Time=\"2012-05-08T10:10:10\" action=\"ConfigurationHeaderBackup.id\"> +1</log>\n<log dateTime=\"2012-05-08T10:10:10\" action=\"Configuration +HeaderBackup.name\">ActualConfiguration</log>\n</header>\n"; $root = "<managedObject class=\"CommonStuff:ABCD\" version=\"1.0\" dis +tName=\"$netType\" id=\"12341234\" vendor=\"XXX\" timeStamp=\"2012-04 +-26T15:18:07\">\n<defaults name=\"System\" id=\"2\"/>\n<extension nam +e=\"system_parameters\">\n<p name=\"\$state\">operational</p>\n</exte +nsion>\n</managedObject>"; $ending = "\n</someData>\n</blah>"; my ($sec,$min,$hour,$day,$month,$yr19,@rest) = localtime(time); open(OUT, ">", $outputFile) or die "cannot open dataOut.txt: $!"; print OUT $header; print OUT $root; for $i(1 .. $loop) { $t= XML::Twig->new( twig_roots => { managedObject => \&handle_mana +gedObject}); $t->parsefile($inputFile); print "\nIteracja: $i / $loop \t-> OK\n"; $bID++; $someID = 0; } print OUT $ending; close (OUT); print "\n----------------\nObjects managed: $managedObjectsAmount \n\n +"; my ($sec2,$min2,$hour2,$day2,$month2,$yr192,@rest2) = localtime(time); printStartTime(); printEndTime(); sub handle_managedObject { my ($t, $element) = @_; @fields = split(/\//, $element->{'att'}->{'distName'}); # distName="MNE-PET/*" - OK if ($fields[0] ne $netType) { $fields[0] = $netType; } # distName="MNE-PET/FLF-1000..1064" - OK if ($fields[1] =~ /^FLF/) { $fields[1] = "FLF-".$bID; if (!$fields[2]) { $element->first_child('p[@name="name"]')->set_text($fields +[1]); } } # distName="MNE-PET/FLF-*/WTF-1..65" -> / FLF if ($fields[2] =~ /^WTF-\w+/) { $fields[2] = "WTF-".$someID; if (!$fields[3]) { $fields[2] = "WTF-".++$someID; $element->first_child('p[@name="name"]')->set_text($fields +[2]); } } # distName="MNE-PET/FLF-*/WTF-*/XLS-1..6" -> /WTF if (($fields[3] =~ /^XLS-\w+/) && (!$fields[4])) { @fieldsFLF = split(/-/, $fields[1]); @fieldsWTF = split(/-/, $fields[2]); @fieldsXLS = split(/-/, $fields[3]); $cId = $fieldsWTF[1].$fieldsXLS[1]; $element->first_child('p[@name="name"]')->set_text($fields[3]) +; $element->first_child('p[@name="cId"]')->set_text($cId); $element->first_child('p[@name="locAreaId1"]')->set_text($fiel +dsFLF[1]); $element->first_child('p[@name="locAreaId2"]')->set_text($mx2G +); $element->first_child('p[@name="locAreaId3"]')->set_text($my2G +); if ($fieldsXLS[1] == 1) { $element->first_child('p[@name="masterWTF"]')->set_text(1) +; $element->first_child('p[@name="segmentId"]')->set_text(++ +$segmentID); } else { $element->first_child('p[@name="masterWTF"]')->set_text(0) +; $element->first_child('p[@name="segmentId"]')->set_text($s +egmentID); } } $element->{'att'}->{'distName'} = join ('/',@fields); $element->{'att'}->{'id'} = $objID++; $element->set_pretty_print( 'indented'); $element->print(\*OUT) or die "Failed to write managedObject to ou +tput XML file:$!\n"; $managedObjectsAmount++; } sub printToFile { $element->set_pretty_print( 'indented'); $element->flush(\*OUT) or die "Failed to write element output XML +file:$!\n"; } sub printStartTime { print "START Time:\t".sprintf("%02d",$hour).":".sprintf("%02d",$mi +n).":".sprintf("%02d",$sec);###To print the current time print "\t$day-".++$month. "-".($yr19+1900)."\n"; ####To print date + format as expected } sub printEndTime { print "END Time:\t".sprintf("%02d",$hour2).":".sprintf("%02d",$min +2).":".sprintf("%02d",$sec2);###To print the current time print "\t$day2-".++$month2. "-".($yr192+1900)."\n"; ####To print d +ate format as expected }

Replies are listed 'Best First'.
Re: XML::Twig and threads
by BrowserUk (Patriarch) on Nov 26, 2012 at 12:58 UTC
    he wants anyway to add threads to this script and speed it up.

    Tell him it simply will not work. XML::Twig uses an OO interface, and sharing objects between threads, whilst possible, will never speed things up. This is because the time cost of accessing shared memory (in Perl) are far higher than accessing private memory.

    If you (he) would care to share a (small) sample of the XML in question that shows the repetitive structure, then there is probably an an effective way to process it in parallel, by breaking up the top level using non-XML parser techniques and then using several non-shared XML parser instances withing threads. But it will be necessary to see a realistic sample to advise further.

    The real problem here is that the design of XML requires that an XML document be treated as an indivisible entity, which -- if you stick to the XML parsing rules -- makes parallel processing of XML not just difficult, but impossible. By design. It is lamentable that XML has become so ingrained in peoples psyches.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    RIP Neil Armstrong

      XML is very simple. I cannot share it as it is company confidential, but it is just like:
      <object some_param="abc" other_param="def"> <attrib1>val1</attrib1> <attrib5>val3</attrib5> </object> <object some_param="xxx"> <attrib3>valx</attrib3> <attrib7>valy</attrib7> </object> <object some_param="xyz"> <attrib1>valx</attrib1> <attrib2>valy</attrib2> <attrib3>valx</attrib3> <attrib4>valy</attrib4> <attrib5>valx</attrib5> <attrib6>valy</attrib6> <attrib7>valx</attrib7> <attrib8>valy</attrib8> </object>
      Many objects (1752 in 93MB file) and each object has list of attributes (up to 700 in 93MB file).

      He further clarified that his concern is yet something else, namely he reads the file into memory, does alterations to some params and writes back to another file. This altered data is used to test the system - e.g. 150 different versions of 10MB file written to one file which is then 1.5GB -> so if we can manage inserting threads into managedObject => \&handle_fasade function it may be really of some help while producing output.

      Simple program reading 100MB XML file took 2 minutes and 3.5GB RAM, I think his 30 hours may be out-of-physical memory problem. I'll add more details tomorrow.

        The first thing to say is that that is not valid XML. (A valid XML document must contain a single top level tag.)

        That said, for the purposes of processing, that (arbitrary) XML rule works in our favour and makes writing a program that processes the large file in smallish chunks very simple:

        #! perl -slw use strict; use XML::Simple; use Data::Dump qw[ pp ]; $/ = '</object>'; while( <DATA> ) { last if /^\n+$/; my $xml = XMLin( $_ ); pp $xml; } __DATA__ <object some_param="abc" other_param="def"> <attrib1>val1</attrib1> <attrib5>val3</attrib5> </object> <object some_param="xxx"> <attrib3>valx</attrib3> <attrib7>valy</attrib7> </object> <object some_param="xyz"> <attrib1>valx</attrib1> <attrib2>valy</attrib2> <attrib3>valx</attrib3> <attrib4>valy</attrib4> <attrib5>valx</attrib5> <attrib6>valy</attrib6> <attrib7>valx</attrib7> <attrib8>valy</attrib8> </object>

        That produces:

        C:\test>t-XML.pl { attrib1 => "val1", attrib5 => "val3", other_param => "def", some_param => "abc", } { attrib3 => "valx", attrib7 => "valy", some_param => "xxx" } { attrib1 => "valx", attrib2 => "valy", attrib3 => "valx", attrib4 => "valy", attrib5 => "valx", attrib6 => "valy", attrib7 => "valx", attrib8 => "valy", some_param => "xyz", }

        In addition to that allowing the huge file to be processed very quickly in minimal memory, it would -- were the processing requirements of the individual chunks sufficiently taxing to warrant it -- enable multiple individual chunks to be processed in parallel with threading very easily.

        But, if the example is anything like representative of the actual data, that above code will probably allow the entire file to be processed sufficiently quickly -- in a very casual test; less that 2 minutes -- that the need for considering threading disappears completely. The saving coming simply from processing the file in small chunks rather than en masse.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        RIP Neil Armstrong

        Hello grizzley.

        namely he reads the file into memory, does alterations to some params and writes back to another file.

        I guess he is using twig_roots and "twig_print_outside_roots=>1" for that. And I was thinking of Template Tool Kit when I read this post.

        regards.

Re: XML::Twig and threads
by mirod (Canon) on Nov 26, 2012 at 12:52 UTC

    Yes, it doesn't make much sense to have the whole handler in a separate thread. What could be done is for the handler in the main thread to extract the data it needs, and then to do the processing in a separate thread. that is assuming that the data can be extracted just from the current element and that processing it doesn't change the original XML.

    An other option might be to split the initial XML and then to process those in parallel. xml_split, a tool that comes with XML::Twig could do this.

    That said, it is indeed stange that it takes so long to process the data. I somehow doubt that the XML parsing is responsible for this.

      I did a simple test:
      use XML::Twig; use threads; $start = time; $t= XML::Twig->new(twig_roots => {managedObject => \&handle_fasade}); $t->parsefile('inputFiles/input100MB.xml'); print "Time: ", time-$start; sub handle_fasade{ }
      and the output was:
      # Time: 149s, 3.5GB RAM # Script quits after 71s
      So you are right - 2 minutes is not much time. What worries me is 3.5GB RAM, because of further clarification in Re^2: XML::Twig and threads.
        So you are right - 2 minutes is not much time. What worries me is 3.5GB RAM

        I'll bet £1 to 1p that if you comment out the use threads;, the memory consumption will barely change.

        It is not at all uncommon for a 100MB XML file to translate into 3.6GB of ram requirement once it has been parsed and the equivalent data structure constructed.

        The memory requirement has nothing to do with threading. Just Perl's well-known tendency to trade memory for cpu.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        RIP Neil Armstrong

        blockquote
Re: XML::Twig and threads
by roboticus (Chancellor) on Nov 26, 2012 at 13:09 UTC

    grizzley:

    It sounds like your friend is confusing "parsing" with "processing". There's no way a 10MB file should take a minute to parse, much less 30 hours. Mirod's suggestion to parse the XML file and pass the parsed data to a processing subroutine in threads should be better. (Assuming, of course, that it's not a CPU or memory bound task.) That way you don't have to figure out how to parse XML in parallel, which is a can of worms.

    ...roboticus

    When your only tool is a hammer, all problems look like your thumb.

Re: XML::Twig and threads
by zentara (Archbishop) on Nov 26, 2012 at 12:46 UTC

      Thanks, sometimes one needs just two proper words to feed Google.

      I e.g. found this one: Script crashes when parsing XML But the hints found there (to replace Thread with threads and use safe_parsefile) didn't help. Still "Free to wrong pool..." error. And on other forums this error msg is unanswered as well. I'll keep digging, though.
Re: XML::Twig and threads
by remiah (Hermit) on Nov 26, 2012 at 16:05 UTC

    Hello, grizzley.

    I have little experience for huge XML files, so I take ready made 100MB xml sample file for example.

    Does your colleague have free memory while his process? XML::Twig will eat up memory for large XML files without "purge" or "flush".

    Bellow is my test script, counting text tag in two ways.

    use strict; use warnings; use XML::Twig; use Time::HiRes; my $cnt1=0; my $b1=Time::HiRes::time(); XML::Twig->new( twig_roots => { 'text' => sub{ $cnt1++; $_[0]->purge;}, }, )->parsefile("standard"); my $e1=Time::HiRes::time(); my $cnt2=0; my $b2=Time::HiRes::time(); XML::Twig->new( twig_roots =>{ '/site/regions/africa//text' => sub{$cnt2++;}, }, )->parsefile("standard"); my $e2=Time::HiRes::time(); print "1. text count=$cnt1, time=".($e1-$b1)."\n"; print "2. text count=$cnt2, time=".($e2-$b2)."\n"; __DATA__ 1. text count=105114, time=111.188741922379 2. text count=1657, time=60.9104990959167
    When I forget to purge(), first example eated up my memory and coredumped. Sometimes, purge() needs some care because it purges inner most element (XML Newbie 's example of Twig has some relation to it).

    And if you can squeeze the range with xpath like expression, it could become faster.

    I agree with other monks opinions ...
    regards.

      He does 'flush' and there are some xpath expressions. I've attached script and fake example XML to original node. It doesn't look like bad design and yet I hope something can be improved there.

        I understand your situation at last.
        So, copying original and reuse it will be like this.

        my $t= XML::Twig->new(); $t->parsefile($inputFile); my $someData =$t->root->first_child; #someData for my $i(1 .. $loop) { for ( $someData->children_copy( 'managedObject') ){ handle_managedObject($t, $_); } print "Iteracja: $i / $loop \t-> OK\n"; $bID++; $someID = 0; }
        It becomes slower than original with my machine. Because deep coping Elt object takes too much time for large XML. I wonder is this same at your environment? Or maybe you already know this ...

        As BrowserUK says, use strict and warning please.

Re: XML::Twig and threads
by Anonymous Monk on Nov 26, 2012 at 14:08 UTC
    Your friend's code has massive bugs in it. 10MB is not large; nor for that matter is 100MB. Not anymore. Also, threads don't make things faster ... they subdivide the computer's time; they don't multiply it. Profile the code to find the bug.
      Also, threads don't make things faster ... they subdivide the computer's time; they don't multiply it.

      Speaking anonymously does not make your "wisdom" any less wrong.

      Threads allow you to use multiple processors concurrently. Thus, under the right circumstances, they multiply the amount of processing that can be done by a single process within a given unit of time.

      Ie. For the right type of processing, the use of threads can "make things faster".

      Bottom line: you're talking bollocks.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

      RIP Neil Armstrong