Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW

Filter for XML elements

by uituit (Initiate)
on Dec 13, 2012 at 06:22 UTC ( #1008630=perlquestion: print w/replies, xml ) Need Help??
uituit has asked for the wisdom of the Perl Monks concerning the following question:

Dear all,

I am new to Perl / XML::Simple and my only programming experience was with Pascal about 10 years ago.

However there is something I really want to do when I handle some xml files. The code I am dealing with is something like this:

<entry rootUID="281088" Class="A" Name="abc"><h-g><h>abc</h><runhd>abc +</runhd><z>/</z><i-g><i>abc123</i></i-g><z>/</z></h-g><zp_pvg /><pv-g +><pv>abc is 123</pv><z> (</z><r r="REG" /><z><it2>Reg</it2>) </z><d t +ranID="1" status="3">abc is a company.<chn localeUID="202" status="3" +>(Chinese translation)</chn></d><z>: </z><x tranID="2" status="3">abc + launched in 1992.<chn localeUID="202" status="3">Some other Chinese +translation</chn></x><obj-g><zp_cl /><z>Symbolmark</z><cl>rules</cl>< +ar>, </ar><cl>decision</cl><ar>, </ar><cl>the law</cl><z>Another Symb +ol</z><syn>Something </syn><cf>(positive)</cf></obj-g><zp_gr /><z>som +e symbol</z><gr-g><gr gr="P" /><z><arit>1+1=3</arit></z></gr-g></pv-g +></entry>

What I want to do is to extract only the things I need from the code above, and generate something like below:

<entry rootUID="281088" Class="A" Name="abc"> <h-g> <h>abc</h> <pv-g> <pv>abc is 123</pv> <r r="REG" /> <d tranID="1" status="3">abc is a company. <chn localeUID="1" status="3">(Chinese translation)</chn> </d> <x tranID="2" status="3">abc launched in 1992. <chn localeUID="202" status="3">Some other Chinese translation</chn> </x> <obj-g> <cl>rules</cl> <cl>decision</cl> <cl>the law</cl> <syn>Something </syn><cf>(positive)</cf> </obj-g> <gr-g> <gr gr="P" /> <arit>1+1=3</arit> </gr-g> </pv-g> </entry>

I successfully converted the xml data into this:

$VAR1 = { 'entry' => { 'h-g' => { 'runhd' => 'abc', 'h' => 'abc', 'z' => [ '/', '/' ], 'i-g' => { 'i' => 'abc123' } }, 'rootUID' => '281088', 'zp_pvg' => {}, 'Class' => 'A', 'Name' => 'abc', 'pv-g' => { 'zp_gr' => {}, 'r' => { 'r' => 'REG' }, 'pv' => 'abc is 123', 'gr-g' => { 'gr' => { 'gr' => 'P' }, 'z' => { 'arit' => '1+1=3' } }, 'x' => { 'chn' => { 'localeUID' => '202', 'status' => '3', 'content' => 'Some othe +r Chinese translation' }, 'tranID' => '2', 'status' => '3', 'content' => 'abc launched in 19 +92.' }, 'd' => { 'chn' => { 'localeUID' => '202', 'status' => '3', 'content' => '(Chinese +translatio n)' }, 'tranID' => '1', 'status' => '3', 'content' => 'abc is a company.' }, 'obj-g' => { 'cl' => [ 'rules', 'decision', 'the law' ], 'cf' => '(positive)', 'syn' => 'Something ', 'z' => [ 'Symbolmark', 'Another Symbol' ], 'zp_cl' => {}, 'ar' => [ ', ', ', ' ] }, 'z' => [ ' (', { 'it2' => 'Reg', 'content' => ') ' }, ': ', 'some symbol' ] } } };

but I don't know how to add filter to the file

In my mind, first, I need to use Simple::XML XMLin to open the file, so to convert the XML into perl structure

second, I need to set some if, then rules to preserve the data

Lastly, to wrap the data back into XML using Simple:xml.

But after some reading, I still have no idea on what to do with Step 2 and Step 3...

Can anyne help?

Replies are listed 'Best First'.
Re: Filter for XML elements
by tobyink (Abbot) on Dec 13, 2012 at 09:02 UTC

    Step 1 is: drop XML::Simple like a rock. If you continue along that route you're in for a world of pain. XML::Simple does have its uses, but what you're doing is not one of them.

    XML::Twig (mentioned above) is specifically designed for the sort of task you're asking about - setting up a bunch of rules to handle incoming XML features and then streaming some XML through them.

    Personally I'd generally use XML::LibXML - not because it's better than XML::Twig for this particular task but because it's the XML library I'm most familiar with. Here's how you could achieve your desired output using XML::LibXML...

    use XML::LibXML 2; use XML::LibXML::PrettyPrint 'print_xml'; my $xml = XML::LibXML->load_xml(IO => \*DATA); # Promote <arit> elements out of their <z> container $xml -> findnodes('//z/arit') -> foreach(sub { $_->parentNode->parentNode->appendChild($_) }); # Remove certain elements $xml -> findnodes('//z | //runhd | //i-g | //zp_pvg | //zp_cl | //ar | +//zp_gr') -> foreach(sub { $_->parentNode->removeChild($_) }); print_xml $xml; __DATA__ <entry rootUID="281088" Class="A" Name="abc"><h-g><h>abc</h><runhd>abc +</runhd> <z>/</z><i-g><i>abc123</i></i-g><z>/</z></h-g><zp_pvg /><pv-g> <pv>abc is 123</pv><z> (</z><r r="REG" /><z><it2>Reg</it2>) </z> <d tranID="1" status="3">abc is a company.<chn localeUID="202" status= +"3" >(Chinese translation)</chn></d><z>: </z><x tranID="2" status="3" >abc launched in 1992.<chn localeUID="202" status="3" >Some other Chinese translation</chn></x><obj-g><zp_cl /> <z>Symbolmark</z><cl>rules</cl><ar>, </ar><cl>decision</cl><ar>, </ar> <cl>the law</cl><z>Another Symbol</z><syn>Something </syn><cf>(positiv +e)</cf> </obj-g><zp_gr /><z>some symbol</z><gr-g><gr gr="P" /><z><arit>1+1=3</ +arit> </z></gr-g></pv-g></entry>
    perl -E'sub Monkey::do{say$_,for@_,do{($monkey=[caller(0)]->[3])=~s{::}{ }and$monkey}}"Monkey say"->Monkey::do'
Re: Filter for XML elements
by Anonymous Monk on Dec 13, 2012 at 07:12 UTC
Re: Filter for XML elements
by mertserger (Curate) on Dec 13, 2012 at 10:02 UTC

    This is an un-Perl answer. I don't know what else you might want to do to this file but have you considered using XSLT instead of Perl? XSLT is designed to handle XML.

      Thank you very much for all the replies, I will look into XML::Twig & XML::LibXML!! Using XSLT is something I have in mind too...yet I am exploring a program way to do it as well...

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1008630]
Approved by mbethke
and nobody stirs...

How do I use this? | Other CB clients
Other Users?
Others perusing the Monastery: (8)
As of 2017-10-21 09:30 GMT
Find Nodes?
    Voting Booth?
    My fridge is mostly full of:

    Results (269 votes). Check out past polls.