Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?

Re^3: XML::Twig and threads

by BrowserUk (Pope)
on Nov 26, 2012 at 16:36 UTC ( #1005712=note: print w/replies, xml ) Need Help??

in reply to Re^2: XML::Twig and threads
in thread XML::Twig and threads [solved]

The first thing to say is that that is not valid XML. (A valid XML document must contain a single top level tag.)

That said, for the purposes of processing, that (arbitrary) XML rule works in our favour and makes writing a program that processes the large file in smallish chunks very simple:

#! perl -slw use strict; use XML::Simple; use Data::Dump qw[ pp ]; $/ = '</object>'; while( <DATA> ) { last if /^\n+$/; my $xml = XMLin( $_ ); pp $xml; } __DATA__ <object some_param="abc" other_param="def"> <attrib1>val1</attrib1> <attrib5>val3</attrib5> </object> <object some_param="xxx"> <attrib3>valx</attrib3> <attrib7>valy</attrib7> </object> <object some_param="xyz"> <attrib1>valx</attrib1> <attrib2>valy</attrib2> <attrib3>valx</attrib3> <attrib4>valy</attrib4> <attrib5>valx</attrib5> <attrib6>valy</attrib6> <attrib7>valx</attrib7> <attrib8>valy</attrib8> </object>

That produces:

C:\test> { attrib1 => "val1", attrib5 => "val3", other_param => "def", some_param => "abc", } { attrib3 => "valx", attrib7 => "valy", some_param => "xxx" } { attrib1 => "valx", attrib2 => "valy", attrib3 => "valx", attrib4 => "valy", attrib5 => "valx", attrib6 => "valy", attrib7 => "valx", attrib8 => "valy", some_param => "xyz", }

In addition to that allowing the huge file to be processed very quickly in minimal memory, it would -- were the processing requirements of the individual chunks sufficiently taxing to warrant it -- enable multiple individual chunks to be processed in parallel with threading very easily.

But, if the example is anything like representative of the actual data, that above code will probably allow the entire file to be processed sufficiently quickly -- in a very casual test; less that 2 minutes -- that the need for considering threading disappears completely. The saving coming simply from processing the file in small chunks rather than en masse.

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

RIP Neil Armstrong

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1005712]
[Corion]: Eily: I've thought of that, but I don't like running long cables through the appartment because sooner or later I'll trip over it, pulling at least one device off its stand :)
[marto]: hmm, may have to patch CPAN::Meta to move from search.cpan to metacpan in the META.json/yml files
[Corion]: marto: Heh - it seems that they plan to keep the links alive for a long time. But still, I plan on moving PM to use/generate the new links
[Corion]: And I think it's better to generate links to the new world instead of keeping the older links alive by generating new versions of them ;)
[marto]: yeah, I guess it's supposed to be a permanent redirect, but better to make the move where possible

How do I use this? | Other CB clients
Other Users?
Others wandering the Monastery: (12)
As of 2018-05-23 09:22 GMT
Find Nodes?
    Voting Booth?