Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

Re^3: XML::Twig and threads

by BrowserUk (Patriarch)
on Nov 26, 2012 at 16:36 UTC ( [id://1005712]=note: print w/replies, xml ) Need Help??


in reply to Re^2: XML::Twig and threads
in thread XML::Twig and threads [solved]

The first thing to say is that that is not valid XML. (A valid XML document must contain a single top level tag.)

That said, for the purposes of processing, that (arbitrary) XML rule works in our favour and makes writing a program that processes the large file in smallish chunks very simple:

#! perl -slw use strict; use XML::Simple; use Data::Dump qw[ pp ]; $/ = '</object>'; while( <DATA> ) { last if /^\n+$/; my $xml = XMLin( $_ ); pp $xml; } __DATA__ <object some_param="abc" other_param="def"> <attrib1>val1</attrib1> <attrib5>val3</attrib5> </object> <object some_param="xxx"> <attrib3>valx</attrib3> <attrib7>valy</attrib7> </object> <object some_param="xyz"> <attrib1>valx</attrib1> <attrib2>valy</attrib2> <attrib3>valx</attrib3> <attrib4>valy</attrib4> <attrib5>valx</attrib5> <attrib6>valy</attrib6> <attrib7>valx</attrib7> <attrib8>valy</attrib8> </object>

That produces:

C:\test>t-XML.pl { attrib1 => "val1", attrib5 => "val3", other_param => "def", some_param => "abc", } { attrib3 => "valx", attrib7 => "valy", some_param => "xxx" } { attrib1 => "valx", attrib2 => "valy", attrib3 => "valx", attrib4 => "valy", attrib5 => "valx", attrib6 => "valy", attrib7 => "valx", attrib8 => "valy", some_param => "xyz", }

In addition to that allowing the huge file to be processed very quickly in minimal memory, it would -- were the processing requirements of the individual chunks sufficiently taxing to warrant it -- enable multiple individual chunks to be processed in parallel with threading very easily.

But, if the example is anything like representative of the actual data, that above code will probably allow the entire file to be processed sufficiently quickly -- in a very casual test; less that 2 minutes -- that the need for considering threading disappears completely. The saving coming simply from processing the file in small chunks rather than en masse.


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

RIP Neil Armstrong

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1005712]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others perusing the Monastery: (5)
As of 2024-04-19 07:34 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found