Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation

Re^3: XML::Twig and threads

by BrowserUk (Pope)
on Nov 26, 2012 at 16:36 UTC ( #1005712=note: print w/replies, xml ) Need Help??

in reply to Re^2: XML::Twig and threads
in thread XML::Twig and threads [solved]

The first thing to say is that that is not valid XML. (A valid XML document must contain a single top level tag.)

That said, for the purposes of processing, that (arbitrary) XML rule works in our favour and makes writing a program that processes the large file in smallish chunks very simple:

#! perl -slw use strict; use XML::Simple; use Data::Dump qw[ pp ]; $/ = '</object>'; while( <DATA> ) { last if /^\n+$/; my $xml = XMLin( $_ ); pp $xml; } __DATA__ <object some_param="abc" other_param="def"> <attrib1>val1</attrib1> <attrib5>val3</attrib5> </object> <object some_param="xxx"> <attrib3>valx</attrib3> <attrib7>valy</attrib7> </object> <object some_param="xyz"> <attrib1>valx</attrib1> <attrib2>valy</attrib2> <attrib3>valx</attrib3> <attrib4>valy</attrib4> <attrib5>valx</attrib5> <attrib6>valy</attrib6> <attrib7>valx</attrib7> <attrib8>valy</attrib8> </object>

That produces:

C:\test> { attrib1 => "val1", attrib5 => "val3", other_param => "def", some_param => "abc", } { attrib3 => "valx", attrib7 => "valy", some_param => "xxx" } { attrib1 => "valx", attrib2 => "valy", attrib3 => "valx", attrib4 => "valy", attrib5 => "valx", attrib6 => "valy", attrib7 => "valx", attrib8 => "valy", some_param => "xyz", }

In addition to that allowing the huge file to be processed very quickly in minimal memory, it would -- were the processing requirements of the individual chunks sufficiently taxing to warrant it -- enable multiple individual chunks to be processed in parallel with threading very easily.

But, if the example is anything like representative of the actual data, that above code will probably allow the entire file to be processed sufficiently quickly -- in a very casual test; less that 2 minutes -- that the need for considering threading disappears completely. The saving coming simply from processing the file in small chunks rather than en masse.

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

RIP Neil Armstrong

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1005712]
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others exploiting the Monastery: (4)
As of 2017-05-30 04:29 GMT
Find Nodes?
    Voting Booth?