grizzley has asked for the wisdom of the Perl Monks concerning the following question:
Friend has a script which is parsing XML file using XML::Twig. 10MB input file with 2000 structures that could be parsed in parallel. He has even bigger files, up to 100MB and script is parsing it even for 30 hours. I already advised him to check why is it taking so long but he wants anyway to add threads to this script and speed it up. And he ended up with script producing error Free to wrong pool 3080610 not 589260 at C:/Perl64/lib/XML/Parser/Expat.pm line 432.
We have minimized the script to following one:
If I comment out join or parsefile or even replace my $currentTh = Thread->new( \&thrsub ); with my $currentTh = Thread->new( { return 0 } ); error does not occur. What is wrong in this code?#!perl -l use XML::Twig; use threads; use Thread; $t= XML::Twig->new(twig_roots => {managedObject => \&handle_fasade}); $t->parsefile('inputFiles/wcel3g.xml'); sub handle_fasade{ my $currentTh = Thread->new( \&thrsub ); $currentTh->join; } sub thrsub{ }
Actually the longer I prepare this node the clearer it is to me that this approach is senseless. Is it even possible to parse XML in parallel? I would rather say XML parsing must be done in one thread and afterwards processing data can be done in parallel. Am I right?
update: fake input xml:
Simple line preparing huge fake input xml (replace number 100 with appropriate value):<?xml version="1.0" encoding="UTF-8"?> <blah version="2.1" xmlns="blah.xsd"> <someData type="actual" name="ActualConfiguration" id="1"> <header> <log dateTime="2012-05-08T10:10:10" action="export"/> </header> <managedObject class="NOKFLF:FLF" distName="MNE-PET/FLF-1000" id="6666 +666000000093362" timeStamp="2012-04-16T18:17:50" vendor="XXX" version +="S14"> <extension name="system_parameters"> <p name="$modifier">UNAUTHENTICATED</p> <p name="$state">operational</p> </extension> <list name="FLFOptions"> <p>0</p> <p>1</p> <p>2</p> <p>3</p> <p>4</p> <p>5</p> <p>7</p> <p>8</p> <p>10</p> <p>12</p> <p>13</p> <p>16</p> <p>17</p> <p>20</p> <p>24</p> <p>25</p> <p>29</p> <p>31</p> <p>32</p> <p>34</p> <p>35</p> <p>36</p> <p>37</p> <p>41</p> <p>42</p> <p>45</p> <p>46</p> <p>47</p> <p>48</p> <p>50</p> <p>51</p> <p>54</p> <p>56</p> <p>61</p> <p>62</p> <p>68</p> <p>69</p> <p>72</p> <p>73</p> <p>74</p> <p>88</p> <p>96</p> <p>107</p> <p>108</p> <p>109</p> <p>117</p> <p>118</p> <p>120</p> <p>123</p> </list> <p name="name1">31</p> <p name="name2">31</p> <p name="name">BRLE8</p> <p name="name4">25</p> <p name="name5">50</p> <p name="name6">10</p> <p name="name7">80</p> <p name="name8">20</p> <p name="name9">100</p> <p name="nameA">20</p> <p name="nameB">2</p> <p name="xyz">1</p> <p name="dbf">0</p> <p name="battery1">30</p> <p name="cpu2">150</p> <p name="FLFType">10</p> <p name="lower">40</p> <p name="upper">60</p> <p name="releaseLimit">4</p> <p name="delay">5</p> <p name="connection1">14</p> <p name="connection2">7</p> <p name="connection3">12</p> <p name="connection4">12</p> <p name="connection5">14</p> <p name="disableExt">0</p> <p name="disableInt">0</p> <p name="frPenalty">3</p> <p name="emerC">1</p> <p name="extraXLSNumber">6</p> <p name="extraBSW">64</p> <p name="RelPri">1</p> <p name="epHoUse">0</p> <p name="frTchim">30</p> <p name="freeDowngrade">95</p> <p name="freeUpgrade">4</p> <p name="freqMeas">30</p> <p name="xCalc">0</p> <p name="param1">4</p> <p name="param2">5</p> <p name="param3">0</p> <p name="param4">0</p> <p name="param5">30</p> <p name="param6">0</p> <p name="param7">10</p> <p name="param8">127</p> <p name="param9">1</p> <p name="param10">0</p> <p name="param20">255</p> <p name="param30">0</p> <p name="dparam1">150</p> <p name="dparam4">100</p> <p name="dparam6">186</p> <p name="dparam8">512</p> <p name="dparam10">30</p> <p name="cparam3">120</p> <p name="cparam5">50</p> <p name="cparam7">50</p> <p name="cparam9">384</p> <p name="cparam11">384</p> <p name="sparam1">21</p> <p name="sparam2">26</p> <p name="sparam3">30</p> <p name="sparam4">20</p> <p name="sparam5">25</p> <p name="sparam6">30</p> <p name="sparam7">24</p> <p name="sparam8">29</p> <p name="sparam9">120</p> <p name="sparam0">60</p> <p name="sparama">60</p> <p name="sparams">240</p> <p name="sparamd">4</p> <p name="sparamgf">1</p> <p name="sparamh">255</p> <p name="sparamh">10</p> <p name="sparamj">30</p> <p name="sparamk">3</p> <p name="sparami">18</p> <p name="sparamu">0</p> <p name="sparamy">8</p> <p name="sparamt">0</p> <p name="sparamr">1</p> <p name="sparamer">1</p> <p name="sparame">9</p> <p name="sparamw">7</p> <p name="somanyparams1">10</p> <p name="somanyparams2">90</p> <p name="somanyparams3">10</p> <p name="somanyparams4">70</p> <p name="somanyparams5">90</p> <p name="somanyparams5">20</p> <p name="somanyparams6">20</p> <p name="somanyparams7">1</p> <p name="somanyparams0">1</p> <p name="somanyparams8">1</p> <p name="somanyparams9">20</p> <p name="somanyparamsa">120</p> <p name="somanyparamss">120</p> <p name="somanyparamsd">1</p> <p name="somanyparamsf">400</p> <p name="somanyparamsg">100</p> <p name="somanyparamsh">200</p> <p name="somanyparamsj">25</p> <p name="somanyparamsk">1</p> <p name="somanyparamsl">66947</p> <p name="somanyparamso">66947</p> <p name="somanyparamsi">66947</p> <p name="somanyparamsu">8</p> <p name="somanyparamsy">0</p> <p name="somanyparamst">65535</p> <p name="somanyparamsr">5</p> <p name="anotherparam1">0</p> <p name="anotherparam2">5</p> <p name="anotherparam3">3</p> <p name="anotherparam4">1</p> <p name="anotherparam5">5</p> <p name="anotherparam6">3</p> <p name="anotherparam7">1</p> <p name="anotherparam8">3</p> <p name="anotherparam9">2</p> <p name="anotherparam0">4</p> <p name="anotherparamq">3</p> <p name="anotherparamw">12</p> <p name="anotherparame">6</p> <p name="anotherparamr">3</p> <p name="anotherparamt">6</p> <p name="anotherparamy">9</p> <p name="anotherparamu">12</p> <p name="anotherparami">20</p> <p name="anotherparamo">10</p> <p name="anotherparamp">5</p> <p name="anotherparama">20</p> <p name="anotherparams">10</p> <p name="anotherparamd">5</p> <p name="anotherparamf">20</p> <p name="anotherparamg">10</p> <p name="anotherparamh">5</p> <p name="anotherparamj">20</p> <p name="anotherparamk">10</p> <p name="anotherparaml">5</p> <p name="anotherparamz">20</p> <p name="anotherparamx">10</p> <p name="anotherparamc">5</p> <p name="anotherparqamb">30</p> <p name="anotherparqamn">0</p> <p name="anotherparqamm">0</p> <p name="anotherparqamas">4152</p> <p name="anotherparqams">0</p> <p name="anotherparqamd">15</p> <p name="anotherparqamf">1</p> <p name="anotherparqamg">10</p> <p name="anotherparqamh">5</p> <p name="anotherparqamj">128</p> <p name="anotherparqamk">0</p> <p name="anotherparqaml">127</p> </managedObject> </someData> </blah>
Script doing the job (to be optimized):perl -p0e "s/<managedObject.*?>.*<\/managedObject>/$& x 100/se" testin +small.xml > testin.xml
use XML::Twig; $inputFile = 'testin.xml'; $outputFile = 'testout.xml'; $loop = 563; $netType = "MNE-1v1"; $mx2G = "002"; $my2G = "02"; $objID = 1; $bID = $firstElementID = 1; $managedObjectsAmount = 0; $someID = 0; $segmentID = 0; $header = "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<!DOCTYPE raml +SYSTEM 'blah.dtd'>\n<blah version=\"2.1\" xmlns=\"blah.xsd\">\n<someD +ata type=\"actual\" name=\"ActualConfiguration\" id=\"1\">\n<header>\ +n<log dateTime=\"2012-05-08T10:10:10\" action=\"export\"/>\n<log date +Time=\"2012-05-08T10:10:10\" action=\"ConfigurationHeaderBackup.id\"> +1</log>\n<log dateTime=\"2012-05-08T10:10:10\" action=\"Configuration +HeaderBackup.name\">ActualConfiguration</log>\n</header>\n"; $root = "<managedObject class=\"CommonStuff:ABCD\" version=\"1.0\" dis +tName=\"$netType\" id=\"12341234\" vendor=\"XXX\" timeStamp=\"2012-04 +-26T15:18:07\">\n<defaults name=\"System\" id=\"2\"/>\n<extension nam +e=\"system_parameters\">\n<p name=\"\$state\">operational</p>\n</exte +nsion>\n</managedObject>"; $ending = "\n</someData>\n</blah>"; my ($sec,$min,$hour,$day,$month,$yr19,@rest) = localtime(time); open(OUT, ">", $outputFile) or die "cannot open dataOut.txt: $!"; print OUT $header; print OUT $root; for $i(1 .. $loop) { $t= XML::Twig->new( twig_roots => { managedObject => \&handle_mana +gedObject}); $t->parsefile($inputFile); print "\nIteracja: $i / $loop \t-> OK\n"; $bID++; $someID = 0; } print OUT $ending; close (OUT); print "\n----------------\nObjects managed: $managedObjectsAmount \n\n +"; my ($sec2,$min2,$hour2,$day2,$month2,$yr192,@rest2) = localtime(time); printStartTime(); printEndTime(); sub handle_managedObject { my ($t, $element) = @_; @fields = split(/\//, $element->{'att'}->{'distName'}); # distName="MNE-PET/*" - OK if ($fields[0] ne $netType) { $fields[0] = $netType; } # distName="MNE-PET/FLF-1000..1064" - OK if ($fields[1] =~ /^FLF/) { $fields[1] = "FLF-".$bID; if (!$fields[2]) { $element->first_child('p[@name="name"]')->set_text($fields +[1]); } } # distName="MNE-PET/FLF-*/WTF-1..65" -> / FLF if ($fields[2] =~ /^WTF-\w+/) { $fields[2] = "WTF-".$someID; if (!$fields[3]) { $fields[2] = "WTF-".++$someID; $element->first_child('p[@name="name"]')->set_text($fields +[2]); } } # distName="MNE-PET/FLF-*/WTF-*/XLS-1..6" -> /WTF if (($fields[3] =~ /^XLS-\w+/) && (!$fields[4])) { @fieldsFLF = split(/-/, $fields[1]); @fieldsWTF = split(/-/, $fields[2]); @fieldsXLS = split(/-/, $fields[3]); $cId = $fieldsWTF[1].$fieldsXLS[1]; $element->first_child('p[@name="name"]')->set_text($fields[3]) +; $element->first_child('p[@name="cId"]')->set_text($cId); $element->first_child('p[@name="locAreaId1"]')->set_text($fiel +dsFLF[1]); $element->first_child('p[@name="locAreaId2"]')->set_text($mx2G +); $element->first_child('p[@name="locAreaId3"]')->set_text($my2G +); if ($fieldsXLS[1] == 1) { $element->first_child('p[@name="masterWTF"]')->set_text(1) +; $element->first_child('p[@name="segmentId"]')->set_text(++ +$segmentID); } else { $element->first_child('p[@name="masterWTF"]')->set_text(0) +; $element->first_child('p[@name="segmentId"]')->set_text($s +egmentID); } } $element->{'att'}->{'distName'} = join ('/',@fields); $element->{'att'}->{'id'} = $objID++; $element->set_pretty_print( 'indented'); $element->print(\*OUT) or die "Failed to write managedObject to ou +tput XML file:$!\n"; $managedObjectsAmount++; } sub printToFile { $element->set_pretty_print( 'indented'); $element->flush(\*OUT) or die "Failed to write element output XML +file:$!\n"; } sub printStartTime { print "START Time:\t".sprintf("%02d",$hour).":".sprintf("%02d",$mi +n).":".sprintf("%02d",$sec);###To print the current time print "\t$day-".++$month. "-".($yr19+1900)."\n"; ####To print date + format as expected } sub printEndTime { print "END Time:\t".sprintf("%02d",$hour2).":".sprintf("%02d",$min +2).":".sprintf("%02d",$sec2);###To print the current time print "\t$day2-".++$month2. "-".($yr192+1900)."\n"; ####To print d +ate format as expected }
|
---|
Replies are listed 'Best First'. | |
---|---|
Re: XML::Twig and threads
by BrowserUk (Patriarch) on Nov 26, 2012 at 12:58 UTC | |
by grizzley (Chaplain) on Nov 26, 2012 at 16:15 UTC | |
by BrowserUk (Patriarch) on Nov 26, 2012 at 16:36 UTC | |
by remiah (Hermit) on Nov 27, 2012 at 09:01 UTC | |
Re: XML::Twig and threads
by mirod (Canon) on Nov 26, 2012 at 12:52 UTC | |
by grizzley (Chaplain) on Nov 26, 2012 at 15:59 UTC | |
by BrowserUk (Patriarch) on Nov 26, 2012 at 16:11 UTC | |
by grizzley (Chaplain) on Nov 28, 2012 at 09:57 UTC | |
by BrowserUk (Patriarch) on Nov 28, 2012 at 13:44 UTC | |
| |
Re: XML::Twig and threads
by roboticus (Chancellor) on Nov 26, 2012 at 13:09 UTC | |
Re: XML::Twig and threads
by zentara (Archbishop) on Nov 26, 2012 at 12:46 UTC | |
by grizzley (Chaplain) on Nov 26, 2012 at 15:52 UTC | |
Re: XML::Twig and threads
by remiah (Hermit) on Nov 26, 2012 at 16:05 UTC | |
by grizzley (Chaplain) on Nov 28, 2012 at 10:04 UTC | |
by remiah (Hermit) on Nov 29, 2012 at 00:31 UTC | |
by BrowserUk (Patriarch) on Nov 29, 2012 at 00:36 UTC | |
by remiah (Hermit) on Nov 29, 2012 at 02:18 UTC | |
| |
Re: XML::Twig and threads
by Anonymous Monk on Nov 26, 2012 at 14:08 UTC | |
by BrowserUk (Patriarch) on Nov 26, 2012 at 14:19 UTC |
Back to
Seekers of Perl Wisdom