http://www.perlmonks.org?node_id=838120

kyle has asked for the wisdom of the Perl Monks concerning the following question:

The question came to me one afternoon, "how long would it take you to write a Perl script that would remove a specific character from within XML tags?"

I replied,

A quick and dirty (i.e., error prone) would take minutes.

To use a real XML parser and guarantee that I don’t corrupt the file in the process might take a couple hours because I’m actually not that familiar with XML.

I wouldn’t be surprised if you could ask the question politely at perlmonks.org and get it written for free in about a half hour.

This was a need-it-now situation, so we went with the quick and dirty:

use strict; use warnings; while (<>) { s{(<[^?<>]*\.[^<>]*>)}{ (my $tagname = $1 ) =~ tr/.//d; $tagname; }eg; print; }

The input to deal with looks like this:

<?xml version="1.0"?> <TOP> <SUB> <THIS>STUFF</THIS> <SOME.TYPE>T</SOME.TYPE> <SOME.OTHER.TYPE>BLAH</SOME.OTHER.TYPE> </SUB> </TOP>

The problem is the dots in the tag names. They need to be stripped out. The output should look like this:

<?xml version="1.0"?> <TOP> <SUB> <THIS>STUFF</THIS> <SOMETYPE>T</SOMETYPE> <SOMEOTHERTYPE>BLAH</SOMEOTHERTYPE> </SUB> </TOP>

Note that my first implementation actually took the dot out of "<?xml version="1.0"?>". Luckily I had the good sense to look at a 'diff' before I stopped debugging.

So, monks, I seek your wisdom. What is the right way to do this so that I don't someday accidentally annihilate some important input? Any guidance you can offer would be appreciated.

Replies are listed 'Best First'.
Re: Replace XML tag names.
by ikegami (Patriarch) on May 03, 2010 at 15:24 UTC
    Using an different parser:
    #!/usr/bin/perl use strict; use warnings; use XML::LibXML; { my $xml = <<'__XML__'; <?xml version="1.0"?> <TOP> <SUB> <THIS>STUFF</THIS> <SOME.TYPE>T</SOME.TYPE> <SOME.OTHER.TYPE>BLAH</SOME.OTHER.TYPE> </SUB> </TOP> __XML__ my $doc = XML::LibXML->new->parse_string($xml); for ($doc->findnodes('//*')) { my $name = $_->nodeName(); $name =~ tr/.//d; $_->setNodeName($name); } print($doc->toString()); }
Re: Replace XML tag names.
by Anonymous Monk on May 03, 2010 at 15:11 UTC
    #!/usr/bin/perl -- use strict; use warnings; use XML::Twig; { my $xml = <<'__XML__'; <?xml version="1.0"?> <TOP> <SUB> <THIS>STUFF</THIS> <SOME.TYPE>T</SOME.TYPE> <SOME.OTHER.TYPE>BLAH</SOME.OTHER.TYPE> </SUB> </TOP> __XML__ my $t = XML::Twig->new( pretty_print => 'indented', twig_handlers => { '_all_' => sub { my ( $t, $e ) = @_; my $tag = $e->tag; if ( $tag =~ s/\.//g ) { $e->set_tag($tag); } $t->flush; ## does the printing return; }, }, ); $t->xparse($xml); $t->purge; undef $t; } __END__ <?xml version="1.0"?> <TOP> <SUB> <THIS>STUFF</THIS> <SOMETYPE>T</SOMETYPE> <SOMEOTHERTYPE>BLAH</SOMEOTHERTYPE> </SUB> </TOP>
Re: Replace XML tag names.
by CountZero (Bishop) on May 04, 2010 at 06:30 UTC
    As a start-tag is not allowed to contain whitespace (since whitespace is used as a delimiter between the tag and the attributes) or the > character, the character class of excluded characters should only contain these two characters. For good measure I have added ? and ! to exclude xml-headers, doctype elements, processings instructions and the like.

    s{(<[^\s?!>]+)}{(my $tagname = $1 ) =~ tr/.//d;$tagname;}eg;

    Yes that will cause the substitiution part to trigger even when there are no dots in the tagname, but it makes the whole of the regex much simpler and easier to understand and debug.

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

      I think you should also handle the <![CDATA[...]]> sections.

      $xml =~ s{(?:(<!\[CDATA\[.*?\]\]>)|(<[^\s?!>]+))}{ $1 or do {(my $tagname = $2 ) =~ tr/.//d;$tagname;} }seg; # or ... sligthly more efficient for XMLs with most tags without the d +ots $xml =~ s{(?:(<!\[CDATA\[.*?\]\]>)|(<[^\s?!>\.]+\.[^\s?!>]+))}{ $1 or do {(my $tagname = $2 ) =~ tr/.//d;$tagname;} }seg; # and now even more efficient thanks to moving the &lt; outside the or $xml =~ s{<(?:(!\[CDATA\[.*?\]\]>)|(<[^\s?!>\.]+\.[^\s?!>]+))}{ '<'. ($1 or do {(my $tagname = $2 ) =~ tr/.//d;$tagname;} ) }seg;
      Benchmark:

      Jenda
      Enoch was right!
      Enjoy the last years of Rome.

Re: Replace XML tag names.
by CountZero (Bishop) on May 04, 2010 at 06:31 UTC
    As a start-tag is not allowed to contain whitespace (since whitespace is used as a delimiter between the tag and the attributes) or the > character, the character class of excluded characters should only contain these two characters. For good measure I have added ? and ! to exclude xml-headers, doctype elements, processings instructions and the like. This regex will avoid deleting dots in attributes within tags.

    s{(<[^\s?!>]+)}{(my $tagname = $1 ) =~ tr/.//d;$tagname;}eg;

    Yes that will cause the substitiution part to trigger even when there are no dots in the tagname, but it makes the whole of the regex much simpler and easier to understand and debug.

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James