Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

Having problems accessing individual attributes in xml

by Gemenon (Initiate)
on Oct 21, 2010 at 00:21 UTC ( [id://866442]=perlquestion: print w/replies, xml ) Need Help??

Gemenon has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I'm trying to write a script to reconstruct directory structures and file names that are described by an XML file, however I'm meeting with mixed success. As far as Perl scripting goes I'm still using my training wheels.

At work we have an application that basically archives directories, and files by renaming all the files and directories into MD5 hash names, then tossing the lot into a single directory. It writes the description of "which file goes where" into an XML document.

Unfortunately, it also tosses in about 10 attributes for every item, only two of which I really need, those being the original name, and the MD5 equivalent name. I found an example of a script that does something similar, and was able to modify it for my needs. The script doesn't seem to like anything complex, like my XML document though. It spits the output out as one long unbroken string of MD5 names, followed by another unbroken string of file names.

It is getting the directory structures right, but just mashing everything in the directory together. I'm just not understanding how to make the script isolate individual attributes correctly for each XML element. Here is an example of the XML data structure:

<ncp_directory op="ADD" ipol="0" iprs="0" uppol="0" upprs="0" upol="0" + uprs="0" vpol="1" vnipol="1" rpol="1" user_specific="0" ntperm="0" n +ame="$dir1" flags="" lm="129232888600305382" cr="129232888600305382" +> <ncp_directory op="ADD" ipol="0" iprs="0" uppol="0" upprs="0" upol +="0" uprs="0" vpol="1" vnipol="1" rpol="1" user_specific="0" ntperm=" +0" name="CutePDFWriter" flags="" lm="129232886309260678" cr="12923271 +1066448490" > <ncp_file op="ADD" ipol="0" iprs="0" uppol="0" upprs="0" upol="0 +" uprs="0" vpol="1" vnipol="1" rpol="1" name="cpwmon2k.dll" length="8 +7552" md5="27A8QATED9I2Ox8F65OGEPPDCIV" flags="a" lm="129018983800000 +000" cr="129232711245774126" gac_register_op="SAME" register="false" +/> <ncp_directory op="ADD" ipol="0" iprs="0" uppol="0" upprs="0" up +ol="0" uprs="0" vpol="1" vnipol="1" rpol="1" user_specific="0" ntperm +="0" name="converter" flags="" lm="129232881029776793" cr="1292328706 +12045954" > <ncp_directory op="ADD" ipol="0" iprs="0" uppol="0" upprs="0" +upol="0" uprs="0" vpol="1" vnipol="1" rpol="1" user_specific="0" ntpe +rm="0" name="GPLGS" flags="" lm="129232870625951047" cr="129232870612 +202191" > <ncp_file op="ADD" ipol="0" iprs="0" uppol="0" upprs="0" upo +l="0" uprs="0" vpol="1" vnipol="1" rpol="1" name="gsdll32.dll" length +="2768896" md5="5F7UGLCH9K3GKxBNML1LM0G3RNL" flags="a" lm="1274070452 +20000000" cr="129232870614545746" gac_register_op="SAME" register="fa +lse" /> <ncp_file op="ADD" ipol="0" iprs="0" uppol="0" upprs="0" upo +l="0" uprs="0" vpol="1" vnipol="1" rpol="1" name="a010013l.pfb" lengt +h="69958" md5="7EDJ7V7QHMBQ1x6HLC54FG0OP6T" flags="a" lm="12685496594 +0000000" cr="129232870612202191" gac_register_op="SAME" /> <!--- truncated for brevity sake--> <ncp_file op="ADD" ipol="0" iprs="0" uppol="0" upprs="0" upo +l="0" uprs="0" vpol="1" vnipol="1" rpol="1" name="z003034l.pfb" lengt +h="113405" md5="D6I2GGENUCQLEx6FMO1IPG1E8F7" flags="a" lm="1268541248 +40000000" cr="129232870625951047" gac_register_op="SAME" /> <ncp_file op="ADD" ipol="0" iprs="0" uppol="0" upprs="0" upo +l="0" uprs="0" vpol="1" vnipol="1" rpol="1" name="zeroline.ps" length +="2567" md5="FETPJPBOOF039xCCTQFGII9DNN0" flags="a" lm="1265889176800 +00000" cr="129232870625951047" gac_register_op="SAME" /> </ncp_directory> <ncp_file op="ADD" ipol="0" iprs="0" uppol="0" upprs="0" upol= +"0" uprs="0" vpol="1" vnipol="1" rpol="1" name="GSSetup.exe" length=" +122880" md5="E61K8P45E8D81x3T3E47C8QIP0U" flags="a" lm="1277517870000 +00000" cr="129232870612045954" gac_register_op="SAME" /> </ncp_directory>

As you can see, it goes Directory name contained files with attributes /end directory tag etc. Here is my script:

use XML::XPath; my $file = 'ncpobjs.xml'; my $xp = XML::XPath->new(filename => $file); foreach my $ncptype ($xp->find('//ncp_directory')->get_nodelist){ print $ncptype->find('ncp_file')->string_value; print ' (' . $ncptype->find('@name') . ') '; print $ncptype->find('ncp_file/@md5'), " ", $ncptype->find('ncp_fi +le/@name'), "\n"; print "\n"; }

And here is an example of the quasi-gibberish that I'm getting as output for each directory level:

(x64) ALSOO431VHGO2x825OF80GN8RNM605U9UMHOR3M1xEQIPMMRKFK3F0 PSCRIPT.HLPPSCRIPT.NTF

So it boils down to how do I change this odd output into something like "MD5 Name = File Name" for each file element? I have the feeling I might need another for-loop inside to deal with the files, I just can't figure out where to place it. Any insight would be very much appreciated!

Replies are listed 'Best First'.
Re: Having problems accessing individual attributes in xml
by ikegami (Patriarch) on Oct 21, 2010 at 06:38 UTC
    Using XML::LibXML syntax, but you should get the idea.
    for my $dir_node ($root->findnodes('//ncp_directory')) { my $dir_name = $dir_node->getAttrribute('name'); for my $file_node ($dir_node->findnodes('ncp_file')) { my $md5 = $file_node->getAttribute('md5'); my $file_name = $file_node->getAttribute('name'); ... } }

      Thank you bigtime ikegami! This works perfectly! I also owe you one!

      Between your script and dasgar's example above I can do what I set out to do, and move ahead into the next stages of my project!

      A big thanks to all who answered my plea for help!
Re: Having problems accessing individual attributes in xml
by Khen1950fx (Canon) on Oct 21, 2010 at 01:10 UTC
    The xml isn't well-formed. It's missing an element at eof---a premature end of data error. Could you post the correct data?

    Update: Here's an example using sample xml from XML::XPath:

    #!/usr/bin/perl use strict; use warnings; use XML::XPath; use XML::XPath::XMLParser; my $file = 'test.xml'; my $xp = XML::XPath->new( filename => $file ); my $nodeset = $xp->find('//employee'); foreach my $node ($nodeset->get_nodelist) { print XML::XPath::XMLParser::as_string($node), "\n\n"; }
    test.xml
    <?xml version="1.0" encoding="ISO-8859-1"?> <timesheet xmlns:a="www" xmlns:b="xxx" xmlns="fred"> <employee> <name> <forename>Matt</forename> <surname>Sergeant</surname> </name> <department>Development IT</department> </employee> <rules> <rule>NextRule1</rule> <rule>NextRule2</rule> </rules> <projects> <project a:Name="Consultancy &gt; fred" b:Name="Fred"> <sunday>0.00</sunday> <monday>0.00</monday> <tuesday>7.75</tuesday> <wednesday>8.75</wednesday> <thursday>7.75</thursday> <friday>6.5</friday> <saturday>0.00</saturday> </project> <project Name="Holiday"> <sunday>0.00</sunday> <monday>7.75</monday> <tuesday>0.00</tuesday> <wednesday>0.00</wednesday> <thursday>0.00</thursday> <friday>0.00</friday> <saturday>0.00</saturday> </project> </projects> </timesheet>
      <

      Hi Khen, Yes the xml is a bit non-standard because it's a proprietary schema. I've already scrubbed it down to this more standard format, but I can scrub it further if the current format is still too alien. I'll post a shortened version of the whole (I just took out a large number of ncp_file elements for easier reading).

      <?xml version="1.0"?> <softpkg NAME="CutePDFWriter" VERSION="0" > <implementation> <processor VALUE="ALL" /> <os VALUE="WinXP" /> <disksize VALUE="0" /> <ncp_sysdisksize VALUE="0" /> <ncp_environment source="user" > </ncp_environment> <ncp_environment source="system" > </ncp_environment> <ncp_directory op="ADD" ipol="0" iprs="0" uppol="0" upprs="0" upol=" +0" uprs="0" vpol="1" vnipol="1" rpol="1" user_specific="0" ntperm="0" + name="$dir1" flags="" lm="129232888600305382" cr="129232888600305382 +" > <ncp_directory op="ADD" ipol="0" iprs="0" uppol="0" upprs="0" upol +="0" uprs="0" vpol="1" vnipol="1" rpol="1" user_specific="0" ntperm=" +0" name="CutePDFWriter" flags="" lm="129232886309260678" cr="12923271 +1066448490" > <ncp_file op="ADD" ipol="0" iprs="0" uppol="0" upprs="0" upol="0 +" uprs="0" vpol="1" vnipol="1" rpol="1" name="cpwmon2k.dll" length="8 +7552" md5="27A8QATED9I2Ox8F65OGEPPDCIV" flags="a" lm="129018983800000 +000" cr="129232711245774126" gac_register_op="SAME" register="false" +/> <ncp_directory op="ADD" ipol="0" iprs="0" uppol="0" upprs="0" up +ol="0" uprs="0" vpol="1" vnipol="1" rpol="1" user_specific="0" ntperm +="0" name="converter" flags="" lm="129232881029776793" cr="1292328706 +12045954" > <ncp_directory op="ADD" ipol="0" iprs="0" uppol="0" upprs="0" +upol="0" uprs="0" vpol="1" vnipol="1" rpol="1" user_specific="0" ntpe +rm="0" name="GPLGS" flags="" lm="129232870625951047" cr="129232870612 +202191" > <ncp_file op="ADD" ipol="0" iprs="0" uppol="0" upprs="0" upo +l="0" uprs="0" vpol="1" vnipol="1" rpol="1" name="gsdll32.dll" length +="2768896" md5="5F7UGLCH9K3GKxBNML1LM0G3RNL" flags="a" lm="1274070452 +20000000" cr="129232870614545746" gac_register_op="SAME" register="fa +lse" /> <ncp_file op="ADD" ipol="0" iprs="0" uppol="0" upprs="0" upo +l="0" uprs="0" vpol="1" vnipol="1" rpol="1" name="a010013l.pfb" lengt +h="69958" md5="7EDJ7V7QHMBQ1x6HLC54FG0OP6T" flags="a" lm="12685496594 +0000000" cr="129232870612202191" gac_register_op="SAME" /> <ncp_file op="ADD" ipol="0" iprs="0" uppol="0" upprs="0" upol= +"0" uprs="0" vpol="1" vnipol="1" rpol="1" name="GSSetup.exe" length=" +122880" md5="E61K8P45E8D81x3T3E47C8QIP0U" flags="a" lm="1277517870000 +00000" cr="129232870612045954" gac_register_op="SAME" /> </ncp_directory> <ncp_directory op="ADD" ipol="0" iprs="0" uppol="0" upprs="0" up +ol="0" uprs="0" vpol="1" vnipol="1" rpol="1" user_specific="0" ntperm +="0" name="Driver" flags="" lm="129232870627357180" cr="1292328706261 +07284" > <ncp_file op="ADD" ipol="0" iprs="0" uppol="0" upprs="0" upol= +"0" uprs="0" vpol="1" vnipol="1" rpol="1" name="ICONLIB.DLL" length=" +118144" md5="D6VRKCQ4IFOSTxCLRJ9GHN6KR6J" flags="a" lm="1260497160000 +00000" cr="129232870626263521" v="4.90.0.3000" gac_register_op="SAME" + register="false" /> <ncp_file op="ADD" ipol="0" iprs="0" uppol="0" upprs="0" upol= +"0" uprs="0" vpol="1" vnipol="1" rpol="1" name="PS5UI.DLL" length="72 +8576" md5="9EJQCU5IT5H58xBCG8GIT8PQEBS" flags="a" lm="128069307720000 +000" cr="129232870626419758" v="0.3.6000.16386" gac_register_op="SAME +" register="false" /> <ncp_directory op="ADD" ipol="0" iprs="0" uppol="0" upprs="0" +upol="0" uprs="0" vpol="1" vnipol="1" rpol="1" user_specific="0" ntpe +rm="0" name="x64" flags="" lm="129232870627825891" cr="12923287062735 +7180" > <ncp_file op="ADD" ipol="0" iprs="0" uppol="0" upprs="0" upo +l="0" uprs="0" vpol="1" vnipol="1" rpol="1" name="PS5UI.DLL" length=" +850432" md5="3REVK8VG65NGUx88P61CUHUN603" flags="a" lm="1280693632000 +00000" cr="129232870627357180" v="0.3.6000.16386" gac_register_op="SA +ME" register="false" /> <ncp_file op="ADD" ipol="0" iprs="0" uppol="0" upprs="0" upo +l="0" uprs="0" vpol="1" vnipol="1" rpol="1" name="PSCRIPT5.DLL" lengt +h="628736" md5="BIAA93SK0Q9VRxBT7BCG7U4L2F0" flags="a" lm="1280693632 +20000000" cr="129232870627825891" v="0.3.6000.16386" gac_register_op= +"SAME" register="false" /> </ncp_directory> <ncp_file op="ADD" ipol="0" iprs="0" uppol="0" upprs="0" upol= +"0" uprs="0" vpol="1" vnipol="1" rpol="1" name="CUTEPDFW.PPD" length= +"31736" md5="1QBHH0SQPIJ2Cx1REJOP1QAUJJK" flags="a" lm="1286320759800 +00000" cr="129232870626107284" gac_register_op="SAME" /> <ncp_file op="ADD" ipol="0" iprs="0" uppol="0" upprs="0" upol= +"0" uprs="0" vpol="1" vnipol="1" rpol="1" name="Cutepdfw.spd" length= +"16697" md5="9M25IA0L5NFKNx60S9M36K0FA6U" flags="a" lm="1275208863200 +00000" cr="129232870626107284" gac_register_op="SAME" /> </ncp_directory> <ncp_file op="ADD" ipol="0" iprs="0" uppol="0" upprs="0" upol="0 +" uprs="0" vpol="1" vnipol="1" rpol="1" name="CPWSave.exe" length="23 +9104" md5="5CD14FVPVPV2TxFDKG6C6OND5U1" flags="a" lm="129018985060000 +000" cr="129232711229841012" v="2.7.3.1" gac_register_op="SAME" /> <ncp_file op="ADD" ipol="0" iprs="0" uppol="0" upprs="0" upol="0 +" uprs="0" vpol="1" vnipol="1" rpol="1" name="install.bat" length="70 +0" md5="DKARE24NS4V4AxCM8QQ89CSDFDI" flags="a" lm="129232882384787064 +" cr="129232753155202156" gac_register_op="SAME" /> </ncp_directory> </ncp_directory> </implementation> </softpkg>
      Thanks!

        My first instinct was to suggest an XML parsing module. However, it looks like your data might not be strictly following XML formatting rules (i.e. Khen1950fx's comment). So, I decided to step into the trap of rolling my own code. The trap comes from working off of assumptions made from your "scrubbed" data.

        Below is the code that I came up with and the output that it produced. Although you didn't ask for path information, it seemed like a natural next step, which is why I went ahead and added it into the code.

        Code:

        Output:

Re: Having problems accessing individual attributes in xml
by Anonymous Monk on Oct 21, 2010 at 01:09 UTC
    In pseudocode, what you're probably looking for is
    for $dirresult (result of '//ncp_directory' wrt $document) { for $fileresult (result of 'ncp_file' wrt $dirresult) { print( (result of '@md5' wrt $fileresult), ' = ', (result of '@name' wrt $fileresult) ); } }
      Thanks for the reply, and yes that pseudocode does look like what I have in mind. Unfortunately I'm so green at this I don't know which Perl commands to use. Thats a start though, at least I can go do some more focused research based on your answer.
Re: Having problems accessing individual attributes in xml
by choroba (Cardinal) on Oct 22, 2010 at 10:23 UTC
    You might also be interested in XML::XSH2.
    open 866457.xml ; for //ncp_file { echo :n @md5 ' ' ; for ancestor::ncp_directory/@name echo :s :n (.) '/' ; echo @name ; }

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://866442]
Approved by ww
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others about the Monastery: (2)
As of 2024-04-25 20:57 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found