http://www.perlmonks.org?node_id=935133

vagabonding electron has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,
please help me on the following issue:
given is the following xml-file (in fact this is an excerpt from the huge file but it shows the problem, imho):
<excerpt> <unit> <unitnumber>1</unitnumber> <Name>Entity A</Name> <Boss>Name</Boss> <contactinfo> <Address> <Street>SomeStreet</Street> <Building>1</Building> <zip>00000</zip> <Ort>Town</Ort> </Address> <Telefon> <code>0123</code> <telnumber>456</telnumber> <directcall>78910</directcall> </Telefon> <Fax> <code>0123</code> <telnumber>456</telnumber> <directcall>10987</directcall> </Fax> <email></email> <URL></URL> </contactinfo> <products> <article> <art_code>A3236</art_code> <quantity>554</quantity> </article> <article> <art_code>B9735</art_code> <quantity>386</quantity> </article> <article> <art_code>C1299</art_code> <quantity>322</quantity> </article> <article> <art_code>D1918</art_code> <quantity_small/> </article> <article> <art_code>E0702</art_code> <quantity_small/> </article> <article> <art_code>F1290</art_code> <quantity_small/> </article> </products> </unit> <unit> <unitnumber>2</unitnumber> <Name>Entity B</Name> <Boss>Name</Boss> <contactinfo> <Address> <Street>SomeOtherStreet</Street> <Building>2</Building> <zip>11111</zip> <Ort>City</Ort> </Address> <Telefon> <code>0999</code> <telnumber>456</telnumber> <directcall>78910</directcall> </Telefon> <Fax> <code>0999</code> <telnumber>456</telnumber> <directcall>10987</directcall> </Fax> <email></email> <URL></URL> </contactinfo> <products> <article> <art_code>A1136</art_code> <quantity>1982</quantity> </article> <article> <art_code>B0765</art_code> <quantity>988</quantity> </article> <article> <art_code>C8099</art_code> <quantity>522</quantity> </article> <article> <art_code>D3938</art_code> <quantity_small/> </article> <article> <art_code>E5722</art_code> <quantity_small/> </article> <article> <art_code>F3596</art_code> <quantity_small/> </article> </products> </unit> </excerpt>
I need (among other things) to get the list of paired values "art_code" and "quantity" per unit - e.g. in the following form:
Entity A;A3236;554 Entity A;B9735;386 Entity A;C1299;322 ... Entity B;A1136;1982 etc.
Since I am a novice in Perl I could only make the following crook so far, the output could be after all proceeded with a regex (flush).
use strict; use warnings; use XML::LibXML; my $filename = "Test.xml"; my $my_object = XML::LibXML->new(); my $treeobjekt = $my_object->parse_file($filename); my $root = $treeobjekt->getDocumentElement; my @units=$treeobjekt->findnodes("//excerpt/unit"); for(my $i=0;$i<@units;$i++) { my $unitname=$units[$i]->findvalue('./Name/text()'); my $art = $units[$i]->findvalue('./products/article'); my $art_chain = join('---', split(/\n/, $art)); print "$unitname;$art_chain\n"; }
I have an additional problem here too. As you see some positions have exact numbers at <quantity> and some others have a tag <quantity_small/>.
I would like to get only the positions where there are exact numbers. I tried to modify the above script in the following way:
my $art_chain; if($units[$i]->findvalue('./products/article/quantity')>0) { $art_chain = join('---', split(/\n/, $art)); }
but it seems to have no effect on the output.
How could I get the paired values in a better way and filter the inexact quantity informations out?
Thank you in advance for your help!
VE

Replies are listed 'Best First'.
Re: How to get paired values from the nested XML structure?
by choroba (Cardinal) on Nov 01, 2011 at 15:01 UTC
    I usually use XML::XSH2 for XML processing. In this case, I'd just need the following lines:
    open 935133.xml ; for /excerpt/unit/products/article/quantity echo (../../../Name) (../a +rt_code) (.) ;
      Unfortunately XML::XSH2 seems not to be available for ActivePerl v.5.12.
      Sorry choroba I forgot to say thank you, this is not my style :-)
Re: How to get paired values from the nested XML structure?
by Jenda (Abbot) on Nov 01, 2011 at 19:01 UTC

    Would you like the data structure like this?

    { 'Entity B' => { 'Boss' => 'Name', 'unitnumber' => '2', 'contactinfo' => { 'URL' => undef, 'email' => undef, 'Telefon' => { 'telnumber' => '456', 'directcall' => '78910', 'code' => '0999' }, 'Address' => { 'zip' => '11111', 'Street' => 'SomeOtherSt +reet', 'Ort' => 'City', 'Building' => '2' }, 'Fax' => { 'telnumber' => '456', 'directcall' => '10987', 'code' => '0999' } }, 'products' => { 'E5722' => 'few', 'C8099' => '522', 'F3596' => 'few', 'B0765' => '988', 'A1136' => '1982', 'D3938' => 'few' } }, 'Entity A' => { 'Boss' => 'Name', 'unitnumber' => '1', 'contactinfo' => { ...
    use strict; use XML::Rules; use Data::Dumper; my $parser = XML::Rules->new( rules => { 'excerpt' => 'pass no content', 'Address,Fax,Telefon,contactinfo,products' => 'no content', 'Boss,Building,Name,Ort,Street,URL,art_code,code,directcall,emai +l,quantity,quantity_small,telnumber,unitnumber,zip' => 'content', 'article' => sub { if (exists $_[1]->{quantity_small}) { return #'%article' =>{ $_[1]->{art_code} => 'few' # }; } else { return #'%article' => { $_[1]->{art_code} => $_[1]->{quantity} #}; } }, 'unit' => 'no content by Name', } ); my $data = $parser->parse(\*DATA); print Dumper($data); __DATA__ <excerpt> <unit> <unitnumber>1</unitnumber> ...

    The base set of rules was generated by: perl -MData::Dumper -MXML::Rules -e "print Dumper(XML::Rules::inferRulesFromExample( 'c:\temp\excerpt.xml'))"

    Jenda
    Enoch was right!
    Enjoy the last years of Rome.

      Thank you Jenda,
      I must read this carefully und try it. I did not know XML::Rules bevor. The good news - this module exists for ActivePerl.
      The "real" huge xml file consists of many nested structures like in the example. They "dive" from the surface of simple data such as the "address" or the "boss name" (and the "unit_id").
      I had hence an idea to make several csv files with id of the unit (here in example shown as unit name) and connect them in the database later. This eclectic (promiscuous? :-)) idea comes since my knowledge of perl is limited and I have to get the things run at the same time.

        If the file is huge you can process it in parts. In this case and with XML::Rules it would mean that the rule for <unit> would be a subroutine that inserts the data of the unit to database and then returns nothing. That way you do not keep the already processed data in memory.

        Another good module for processing huge XML files is XML::Twig.

        Jenda
        Enoch was right!
        Enjoy the last years of Rome.

Re: How to get paired values from the nested XML structure?
by vagabonding electron (Curate) on Nov 01, 2011 at 19:20 UTC
    Dear All,

    it seems to run now in the following setting:
    use strict; use warnings; use XML::LibXML; my $filename = "Test.xml"; my $my_object = XML::LibXML->new(); my $treeobjekt = $my_object->parse_file($filename); my $root = $treeobjekt->getDocumentElement; my @articles = $treeobjekt->findnodes('//article'); for(my $j=0;$j<@articles;$j++) { my $unitname = $articles[$j]->parentNode->parentNode->findvalu +e('./Name/text()'); my $article = $articles[$j]->findvalue('./art_code/text()'); my $amount = $articles[$j]->findvalue('./quantity/text()')//"0 +"; print "$unitname;$article;$amount\n"; }
    prints:
    Entity A;A3236;554 Entity A;B9735;386 Entity A;C1299;322 Entity A;D1918; Entity A;E0702; Entity A;F1290; Entity B;A1136;1982 Entity B;B0765;988 Entity B;C8099;522 Entity B;D3938; Entity B;E5722; Entity B;F3596;
    which can be easily processed further.
    I still do not understand why the condition
    findvalue('./quantity/text()')//"0";
    does not work.
    I would be very glad if you would improve this code.
    Many thanks!
    VE
      You can use Data::Dumper to see what your variables contain. For articles without quantity, $amount is the empty string q(), which means it is defined. If you replace // with ||, you will get the desired behaviour.
        Thank you choroba it works this way! Again I've learned on PM!
        ... now trying this "in real life" with the complete file.
        Thanks again.
        VE