Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine

Some questions from beginning user of XML::LibXML and XPath

by eyepopslikeamosquito (Chancellor)
on Oct 16, 2012 at 09:05 UTC ( #999263=perlquestion: print w/replies, xml ) Need Help??
eyepopslikeamosquito has asked for the wisdom of the Perl Monks concerning the following question:

I've been able to mostly avoid XML until today. We need to update hundreds of MS vs2010 project (XML) files automatically. Tedious and error-prone to do by hand, so I'd like to write a script to do it. I've prepared an illustrative cut-down example of such a script, which changes the directory "ReleaseDLL" to "ReleaseDLL32" in various places in the XML.

Since this is my first attempt to parse XML using Perl, I welcome any advice you may have to offer. In particular:

  • After some random googling, I chose to use XML::LibXML. Is that a wise choice?
  • Given that I want to make minor updates to many XML files, is the overall approach below ok? Is there a better approach?
  • I had a hell of a time getting XPath to work (see code below). And I don't really understand what I did with namespaces, though it does appear to work. Suggestions welcome.
  • The XPath query "PropertyGroup[contains(\@Condition,'$proj')]" is inelegant in that it selects the required PropertyGroup, then manually iterates through each element in the group. It seems better to select the required nodes directly as part of a more complicated XPath expression and avoid the iteration, but I have no clue how to write an XPath query to do that.

Here is an example (cut-down) project XML file to be updated, fred.vcxproj:

<?xml version="1.0" encoding="utf-8"?> <Project DefaultTargets="Build" ToolsVersion="4.0" xmlns="http://schem"> <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Debug Tan +dem|x64'"> <OutDir>.\DebugTandem\</OutDir> <IntDir>.\DebugTandem\</IntDir> <TargetName>fred$(ProjectName)</TargetName> </PropertyGroup> <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Release D +LL|Win32'"> <OutDir>.\../../products/bin/ReleaseDLL\</OutDir> <IntDir>.\ReleaseDLL\</IntDir> <LinkIncremental>false</LinkIncremental> <TargetName>fred$(ProjectName)</TargetName> </PropertyGroup> <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Release D +LL|x64'"> <OutDir>.\../../products/bin/ReleaseDLL\</OutDir> <IntDir>.\ReleaseDLL\</IntDir> <LinkIncremental>false</LinkIncremental> <TargetName>fred$(ProjectName)</TargetName> </PropertyGroup> </Project>

Here is my cut-down test program,

use strict; use warnings; use XML::LibXML; use XML::LibXML::XPathContext; sub read_file_contents { my $fname = shift; open( my $fh, '<', $fname ) or die "error: open '$fname': $!\n"; binmode $fh; local $/ = undef; # slurp mode my $s = <$fh>; close($fh); return $s; } sub write_file_contents { my ( $fname, $data ) = @_; my $overw = -e $fname ? " (overwriting)" : ""; print "creating '$fname'$overw..."; open( my $fh, '>', $fname ) or die "error: open '$fname': $!"; binmode($fh); print {$fh} $data or die "error: write '$fname': $!"; close($fh); print "done.\n"; } my $fname = shift or die "usage: $0 fname\n"; print "xml file : '$fname'\n"; my $xmlstring = read_file_contents($fname); # XXX: Hack for utf8 BOM. # my $UTF8_BOM = chr(0xef) . chr(0xbb) . chr(0xbf); my $UTF8_BOM = ""; # XXX: Without this damned billygates namespace I could not get XPath +to work. my $xpath_ns = 'billygates'; my $vs2010_ns = ''; my $outfile = 'fred.tmp'; my $proj = 'Release DLL|Win32'; my $targ = 'ReleaseDLL'; my $repl = 'ReleaseDLL32'; my $query = "PropertyGroup[contains(\@Condition,'$proj')]"; my $ns_query = "//$xpath_ns:$query"; my $parser = XML::LibXML->new(); my $doc = $parser->parse_string($xmlstring); my $xc = XML::LibXML::XPathContext->new( $doc->documentElement( +) ); $xc->registerNs( $xpath_ns => $vs2010_ns ); print "query : $ns_query:\n"; for my $q ( $xc->findnodes($ns_query) ) { print $q->nodeName(), ":\n"; for my $c ( $q->childNodes() ) { my $name = $c->nodeName(); my $val = $c->textContent(); print " ", ref($c), ":", $name, ":\n"; if ( defined($val) && $val =~ m{[/\\](?:$targ)[/\\]} ) { print " $name: val=$val: matches '$targ'\n"; for my $t ( $c->childNodes() ) { my $v = $t->data; print " ", ref($t), ":", $t->nodeName(), ":", $v, ":\n" +; print " old:", $v, ":\n"; $v =~ s{([/\\])$targ([/\\])}{$1$repl$2} or die "oops"; $t->setData($v); print " new:", $v, ":\n"; } } } } write_file_contents( $outfile, $UTF8_BOM . $doc->toString(0) );

An example run of this program seems to more-or-less work, as shown below:

$ perl fred.vcxproj xml file : 'fred.vcxproj' query : //billygates:PropertyGroup[contains(@Condition,'Release DL +L|Win32')]: PropertyGroup: XML::LibXML::Text:#text: XML::LibXML::Element:OutDir: OutDir: val=.\../../products/bin/ReleaseDLL\: matches 'ReleaseDLL' XML::LibXML::Text:#text:.\../../products/bin/ReleaseDLL\: old:.\../../products/bin/ReleaseDLL\: new:.\../../products/bin/ReleaseDLL32\: XML::LibXML::Text:#text: XML::LibXML::Element:IntDir: IntDir: val=.\ReleaseDLL\: matches 'ReleaseDLL' XML::LibXML::Text:#text:.\ReleaseDLL\: old:.\ReleaseDLL\: new:.\ReleaseDLL32\: XML::LibXML::Text:#text: XML::LibXML::Element:LinkIncremental: XML::LibXML::Text:#text: XML::LibXML::Element:TargetName: XML::LibXML::Text:#text: creating 'fred.tmp' (overwriting)...done. $ diff fred.vcxproj fred.tmp 2c2 < <Project DefaultTargets="Build" ToolsVersion="4.0" xmlns="http://sch"> --- > <Project xmlns="" + DefaultTargets="Build" ToolsVersion="4.0"> 9,10c9,10 < <OutDir>.\../../products/bin/ReleaseDLL\</OutDir> < <IntDir>.\ReleaseDLL\</IntDir> --- > <OutDir>.\../../products/bin/ReleaseDLL32\</OutDir> > <IntDir>.\ReleaseDLL32\</IntDir>

Replies are listed 'Best First'.
Re: Some questions from beginning user of XML::LibXML and XPath
by Corion (Pope) on Oct 16, 2012 at 09:16 UTC

    If your XML is machine-generated and consistent, a line-by-line regular expression might still get you the results faster. But if your preconditions stretch across multiple lines, I'd stay with XML::LibXML.

    From my cursory reading of your XML and your source code, it seems that you are interested in the IntDir and OutDir nodes, as "only these can contain the ReleaseDLL directory" (famous last words here). I'd then make the XPath expression more explicit:

    # For the IntDir nodes //PropertyGroup[contains(\@Condition,'$proj')]/IntDir # For the OutDir nodes //PropertyGroup[contains(\@Condition,'$proj')]/OutDir

    If you are hell-bent on producing and using one single XPath expression, you can combine the two using the self:: axis. I prefer to avoid such stuff and just keep a list of XPath expressions to run instead:

    //PropertyGroup[contains(\@Condition,'$proj')]/*[self::IntDir or self: +:OutDir]
Re: Some questions from beginning user of XML::LibXML and XPath
by choroba (Chancellor) on Oct 16, 2012 at 10:10 UTC
    If you find XML::LibXML too verbose, you might like its wrapper XML::XSH2:
    #!/usr/bin/perl use warnings; use strict; use XML::XSH2; $XML::XSH2::Map::file = '1.xml'; $XML::XSH2::Map::project = 'Release DLL|Win32'; $XML::XSH2::Map::old = 'ReleaseDLL'; $XML::XSH2::Map::new = 'ReleaseDLL32'; xsh << 'end ;' open $file ; register-namespace msb +ild/2003 ; for //msb:PropertyGroup[contains(@Condition, $project)] { for ( ./msb:IntDir | ./msb:OutDir ) { set . xsh:subst(text(), $old, $new) ; } } save :b ; end ;
    One of the features of xsh is you can run it in interactive mode in which it is easy to test your more complicated XPath expressions.
    لսႽ ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ
Re: Some questions from beginning user of XML::LibXML and XPath
by Jim (Curate) on Oct 16, 2012 at 16:27 UTC

    I would have used regular expression pattern matching for this seemingly trivial text substitution (insertion) problem. The formatting of the XML is quite regular and straightforward. Both the string you're matching and the string you're replacing (enhancing) it with are distinct and uncomplicated. You say you "had a hell of a time getting XPath to work." I wouldn't have had the patience to try.

    You're explicitly handling both the input text and the output text as binary data rather than as Unicode text? Why?

    Here's the operation reduced to a Unicode-conformant one-liner:

    C:\Temp>perl -CiO -i.bak -pe "s{(?<=[/\\]ReleaseDLL)(?=[/\\])}{32} if +m{^\s*<(?:Out|Int)Dir>}" fred.vcxproj C:\Temp>diff fred.vcxproj.bak fred.vcxproj 9,10c9,10 < <OutDir>.\../../products/bin/ReleaseDLL\</OutDir> < <IntDir>.\ReleaseDLL\</IntDir> --- > <OutDir>.\../../products/bin/ReleaseDLL32\</OutDir> > <IntDir>.\ReleaseDLL32\</IntDir> 15,16c15,16 < <OutDir>.\../../products/bin/ReleaseDLL\</OutDir> < <IntDir>.\ReleaseDLL\</IntDir> --- > <OutDir>.\../../products/bin/ReleaseDLL32\</OutDir> > <IntDir>.\ReleaseDLL32\</IntDir> C:\Temp>od -h -N 3 fred.vcxproj 0000000000 EF BB BF 0000000003 C:\Temp>

    Modify the anchoring regular expression patterns to taste.

    Doing it this way avoids the needless and undesirable reordering of the attributes of the <Project> element—and a lot of other XML folderol besides. It also handles the input and output properly as Unicode text rather than as binary data and leaves the existing UTF-8 byte order mark intact.

    Modifying this one-liner to support file and folder name globs (wildcards) is left as an exercise for the reader.

    UPDATE:  With modern versions of Perl, you can use the special look-behind assertion \K to obviate the separate pattern match used to anchor the substitution (insertion) to just those lines that have <OutDir> and <IntDir> elements on them.

    C:\>perl -CiO -i.bak -pe "INIT { @ARGV = <@ARGV> } s{^\s*<(?:Out|Int)D +ir>.+?[/\\]ReleaseDLL\K}{32}" */*.vcxproj C:\>diff Temp\fred.vcxproj.bak Temp\fred.vcxproj 9,10c9,10 < <OutDir>.\../../products/bin/ReleaseDLL\</OutDir> < <IntDir>.\ReleaseDLL\</IntDir> --- > <OutDir>.\../../products/bin/ReleaseDLL32\</OutDir> > <IntDir>.\ReleaseDLL32\</IntDir> 15,16c15,16 < <OutDir>.\../../products/bin/ReleaseDLL\</OutDir> < <IntDir>.\ReleaseDLL\</IntDir> --- > <OutDir>.\../../products/bin/ReleaseDLL32\</OutDir> > <IntDir>.\ReleaseDLL32\</IntDir> C:\>
Re: Some questions from beginning user of XML::LibXML and XPath
by KevinZwack (Chaplain) on Oct 16, 2012 at 16:50 UTC
Re: Some questions from beginning user of XML::LibXML and XPath
by Jenda (Abbot) on Oct 17, 2012 at 14:10 UTC
    use strict; use XML::Rules; my $filter = XML::Rules->new( style => 'filter', rules => { 'IntDir,OutDir' => sub { my ($tag,$attr,$context,$parents) = @_; $attr->{_content} =~ s/\bReleaseDLL\b/ReleaseDLL32/ if $context->[-1] eq 'PropertyGroup' && $parents->[-1] +->{Condition} =~ /'Release DLL\|Win32'/; return $tag => $attr; }, } ); $filter->filterfile($inputFilename, $outputFilename);

    Or if you want to be correct according to the namespaces

    use strict; use XML::Rules; my $filter = XML::Rules->new( style => 'filter', namespaces => { '' => 'ms', '*' => 'keep' }, rules => { 'ms:IntDir,ms:OutDir' => sub { my ($tag,$attr,$context,$parents) = @_; $attr->{_content} =~ s/\bReleaseDLL\b/ReleaseDLL32/ if $context->[-1] eq 'ms:PropertyGroup' && $parents->[ +-1]->{Condition} =~ /'Release DLL\|Win32'/; return $tag => $attr; }, } ); $filter->filterfile($inputFilename, $outputFilename);

    Not using XML::LibXML though :-)

    Enoch was right!
    Enjoy the last years of Rome.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://999263]
Approved by Ratazong
Front-paged by Corion
[Corion]: Also I found that I can't conveniently weaken an array slot, which also is inconvenient, as I want my one-shots to disappear if the caller discards them
[Corion]: choroba: Currently two or three that my program handles (WWW::Mechanize:: Chrome), but there might be more that become interesting
[Corion]: But I don't expect more than 100 to be active at the same time, so I'm not really sure if there is a not-too-fancy data structure that is maintained with few lines of code where the performance is better than the linear scan ;)
[Corion]: But I should do a mock-up program so that others can see what I'm talking about ;)
[robby_dobby]: Corion: I hope you know all too well that passing around "fancy" datastructures is a recipe for disaster :-)
[robby_dobby]: As in, it's-too-fancy- that-it-will-be- messy-to-handle
[choroba]: bit vectors as keys?
[robby_dobby]: Hmm, I keep falling asleep at my desk, while maintaining an active appearance. Am I getting old?
[robby_dobby]: Every time I fall asleep, there's a small guy in the dreams, shouting "Whoo!" and it jolts me awake. :/
[Lady_Aleena]: robby_dobby, at least you aren't driving. I seem to always be driving somewhere in my dreams and end up at a weird house.

How do I use this? | Other CB clients
Other Users?
Others scrutinizing the Monastery: (10)
As of 2017-05-29 08:01 GMT
Find Nodes?
    Voting Booth?