Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

Replace XML tag names.

by kyle (Abbot)
on May 03, 2010 at 14:29 UTC ( #838120=perlquestion: print w/ replies, xml ) Need Help??
kyle has asked for the wisdom of the Perl Monks concerning the following question:

The question came to me one afternoon, "how long would it take you to write a Perl script that would remove a specific character from within XML tags?"

I replied,

A quick and dirty (i.e., error prone) would take minutes.

To use a real XML parser and guarantee that I donít corrupt the file in the process might take a couple hours because Iím actually not that familiar with XML.

I wouldnít be surprised if you could ask the question politely at perlmonks.org and get it written for free in about a half hour.

This was a need-it-now situation, so we went with the quick and dirty:

use strict; use warnings; while (<>) { s{(<[^?<>]*\.[^<>]*>)}{ (my $tagname = $1 ) =~ tr/.//d; $tagname; }eg; print; }

The input to deal with looks like this:

<?xml version="1.0"?> <TOP> <SUB> <THIS>STUFF</THIS> <SOME.TYPE>T</SOME.TYPE> <SOME.OTHER.TYPE>BLAH</SOME.OTHER.TYPE> </SUB> </TOP>

The problem is the dots in the tag names. They need to be stripped out. The output should look like this:

<?xml version="1.0"?> <TOP> <SUB> <THIS>STUFF</THIS> <SOMETYPE>T</SOMETYPE> <SOMEOTHERTYPE>BLAH</SOMEOTHERTYPE> </SUB> </TOP>

Note that my first implementation actually took the dot out of "<?xml version="1.0"?>". Luckily I had the good sense to look at a 'diff' before I stopped debugging.

So, monks, I seek your wisdom. What is the right way to do this so that I don't someday accidentally annihilate some important input? Any guidance you can offer would be appreciated.

Comment on Replace XML tag names.
Select or Download Code
Re: Replace XML tag names.
by Anonymous Monk on May 03, 2010 at 15:11 UTC
    #!/usr/bin/perl -- use strict; use warnings; use XML::Twig; { my $xml = <<'__XML__'; <?xml version="1.0"?> <TOP> <SUB> <THIS>STUFF</THIS> <SOME.TYPE>T</SOME.TYPE> <SOME.OTHER.TYPE>BLAH</SOME.OTHER.TYPE> </SUB> </TOP> __XML__ my $t = XML::Twig->new( pretty_print => 'indented', twig_handlers => { '_all_' => sub { my ( $t, $e ) = @_; my $tag = $e->tag; if ( $tag =~ s/\.//g ) { $e->set_tag($tag); } $t->flush; ## does the printing return; }, }, ); $t->xparse($xml); $t->purge; undef $t; } __END__ <?xml version="1.0"?> <TOP> <SUB> <THIS>STUFF</THIS> <SOMETYPE>T</SOMETYPE> <SOMEOTHERTYPE>BLAH</SOMEOTHERTYPE> </SUB> </TOP>
Re: Replace XML tag names.
by ikegami (Pope) on May 03, 2010 at 15:24 UTC
    Using an different parser:
    #!/usr/bin/perl use strict; use warnings; use XML::LibXML; { my $xml = <<'__XML__'; <?xml version="1.0"?> <TOP> <SUB> <THIS>STUFF</THIS> <SOME.TYPE>T</SOME.TYPE> <SOME.OTHER.TYPE>BLAH</SOME.OTHER.TYPE> </SUB> </TOP> __XML__ my $doc = XML::LibXML->new->parse_string($xml); for ($doc->findnodes('//*')) { my $name = $_->nodeName(); $name =~ tr/.//d; $_->setNodeName($name); } print($doc->toString()); }
Re: Replace XML tag names.
by CountZero (Bishop) on May 04, 2010 at 06:30 UTC
    As a start-tag is not allowed to contain whitespace (since whitespace is used as a delimiter between the tag and the attributes) or the > character, the character class of excluded characters should only contain these two characters. For good measure I have added ? and ! to exclude xml-headers, doctype elements, processings instructions and the like.

    s{(<[^\s?!>]+)}{(my $tagname = $1 ) =~ tr/.//d;$tagname;}eg;

    Yes that will cause the substitiution part to trigger even when there are no dots in the tagname, but it makes the whole of the regex much simpler and easier to understand and debug.

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

      I think you should also handle the <![CDATA[...]]> sections.

      $xml =~ s{(?:(<!\[CDATA\[.*?\]\]>)|(<[^\s?!>]+))}{ $1 or do {(my $tagname = $2 ) =~ tr/.//d;$tagname;} }seg; # or ... sligthly more efficient for XMLs with most tags without the d +ots $xml =~ s{(?:(<!\[CDATA\[.*?\]\]>)|(<[^\s?!>\.]+\.[^\s?!>]+))}{ $1 or do {(my $tagname = $2 ) =~ tr/.//d;$tagname;} }seg; # and now even more efficient thanks to moving the &lt; outside the or $xml =~ s{<(?:(!\[CDATA\[.*?\]\]>)|(<[^\s?!>\.]+\.[^\s?!>]+))}{ '<'. ($1 or do {(my $tagname = $2 ) =~ tr/.//d;$tagname;} ) }seg;
      Benchmark:

      Jenda
      Enoch was right!
      Enjoy the last years of Rome.

Re: Replace XML tag names.
by CountZero (Bishop) on May 04, 2010 at 06:31 UTC
    As a start-tag is not allowed to contain whitespace (since whitespace is used as a delimiter between the tag and the attributes) or the > character, the character class of excluded characters should only contain these two characters. For good measure I have added ? and ! to exclude xml-headers, doctype elements, processings instructions and the like. This regex will avoid deleting dots in attributes within tags.

    s{(<[^\s?!>]+)}{(my $tagname = $1 ) =~ tr/.//d;$tagname;}eg;

    Yes that will cause the substitiution part to trigger even when there are no dots in the tagname, but it makes the whole of the regex much simpler and easier to understand and debug.

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://838120]
Approved by moritz
Front-paged by moritz
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others rifling through the Monastery: (7)
As of 2014-10-21 04:27 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    For retirement, I am banking on:










    Results (95 votes), past polls