<?xml version="1.0" encoding="windows-1252"?>
<node id="1005712" title="Re^3: XML::Twig and threads" created="2012-11-26 11:36:39" updated="2012-11-26 11:36:39">
<type id="11">
note</type>
<author id="171588">
BrowserUk</author>
<data>
<field name="doctext">
&lt;blockquote&gt;&lt;i&gt;&lt;/i&gt;&lt;/blockquote&gt;
&lt;p&gt;The first thing to say is that that is not valid XML. (A valid XML document must contain a single top level tag.)

&lt;P&gt;That said, for the purposes of processing, that (arbitrary) XML rule works in our favour and makes writing a program that processes the large file in smallish chunks very simple:&lt;code&gt;
#! perl -slw
use strict;
use XML::Simple;
use Data::Dump qw[ pp ];

$/ = '&lt;/object&gt;';
while( &lt;DATA&gt; ) {
    last if /^\n+$/;
    my $xml = XMLin( $_ );
    pp $xml;
}




__DATA__
&lt;object some_param="abc" other_param="def"&gt;
  &lt;attrib1&gt;val1&lt;/attrib1&gt;
  &lt;attrib5&gt;val3&lt;/attrib5&gt;
&lt;/object&gt;
&lt;object some_param="xxx"&gt;
  &lt;attrib3&gt;valx&lt;/attrib3&gt;
  &lt;attrib7&gt;valy&lt;/attrib7&gt;
&lt;/object&gt;
&lt;object some_param="xyz"&gt;
  &lt;attrib1&gt;valx&lt;/attrib1&gt;
  &lt;attrib2&gt;valy&lt;/attrib2&gt;
  &lt;attrib3&gt;valx&lt;/attrib3&gt;
  &lt;attrib4&gt;valy&lt;/attrib4&gt;
  &lt;attrib5&gt;valx&lt;/attrib5&gt;
  &lt;attrib6&gt;valy&lt;/attrib6&gt;
  &lt;attrib7&gt;valx&lt;/attrib7&gt;
  &lt;attrib8&gt;valy&lt;/attrib8&gt;
&lt;/object&gt;
&lt;/code&gt;

&lt;P&gt;That produces:&lt;code&gt;
C:\test&gt;t-XML.pl
{
  attrib1     =&gt; "val1",
  attrib5     =&gt; "val3",
  other_param =&gt; "def",
  some_param  =&gt; "abc",
}

{ attrib3 =&gt; "valx", attrib7 =&gt; "valy", some_param =&gt; "xxx" }

{
  attrib1    =&gt; "valx",
  attrib2    =&gt; "valy",
  attrib3    =&gt; "valx",
  attrib4    =&gt; "valy",
  attrib5    =&gt; "valx",
  attrib6    =&gt; "valy",
  attrib7    =&gt; "valx",
  attrib8    =&gt; "valy",
  some_param =&gt; "xyz",
}
&lt;/code&gt;

&lt;p&gt;In addition to that allowing the huge file to be processed very quickly in minimal memory, it would -- were the processing requirements of the individual chunks sufficiently taxing to warrant it -- enable multiple individual chunks to be processed in parallel with threading very easily.

&lt;P&gt;But, if the example is anything like representative of the actual data, that above code will probably allow the entire file to be processed sufficiently quickly -- in a very casual test; less that 2 minutes -- that the need for considering threading disappears completely. The saving coming simply from processing the file in small chunks rather than en masse.

&lt;div class="pmsig"&gt;&lt;div class="pmsig-171588"&gt;
&lt;hr /&gt;
&lt;font size=1 &gt;
&lt;div&gt;With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'&lt;/div&gt;
&lt;div&gt;Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.&lt;/div&gt;
&lt;div&gt;"Science is about questioning the status quo. Questioning authority". &lt;/div&gt;
&lt;div&gt;In the absence of evidence, opinion is indistinguishable from prejudice.
&lt;p align=right&gt; [http://thebottomline.cpaaustralia.com.au/|RIP Neil Armstrong]&lt;/p&gt;&lt;/div&gt;
&lt;/font&gt;

&lt;/div&gt;&lt;/div&gt;</field>
<field name="root_node">
1005623</field>
<field name="parent_node">
1005707</field>
</data>
</node>
