Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Re: remove xml tag and their content

by AnomalousMonk (Archbishop)
on Oct 28, 2011 at 00:46 UTC ( [id://934278]=note: print w/replies, xml ) Need Help??


in reply to remove xml tag and their content

Use of one of the fine CPAN XML parsing modules is almost certainly the best course. Perhaps some monk better versed than I in XML parsing can suggest appropriate choices. Novice monks often protest that these XML modules represent "too much code for my application" and want "just a simple" solution. This desire is usually a snare and a delusion: XML is complicated, and "simple" solutions are fragile and scale poorly.

However, if you are set on a simple solution, here are a couple of regex-based ones. Both can operate on strings containing embedded double-quotes and other stuff. Again, both are inherently fragile. The second approach is both more specific as to the tags to be deleted and more tolerant of tag casing and whitespace.

Update: Changed following code example to be more Windose double-quote-friendly.

>perl -wMstrict -le "my $s = '<a>foo</a><bc>bar</bc> <def>baz</def> \"x\" <ghij>%&*</ghij>'; print qq{'$s'}; ;; $s =~ s{ < ([^>]+) > ((?: (?! </ \1) .)*) </ \1 > }{$2}xmsg; print qq{'$s'}; ;; ;; $s = '<B>foo</ b > <efg>bar</efg> \"stuff\" <cD >*&!</ Cd>'; print qq{'$s'}; ;; my @tags = qw(b cd); my $tag = join '|', @tags; $tag = qr{ (?i) $tag }xms; use re 'eval'; $s =~ s{ < \s* ($tag) \s* > ((?: (?! </ \s* \1) .)*) </ \s* ([^>]*) (?(?{ lc($1) ne lc($^N) }) (*F)) \s* > } {$2}xmsg; print qq{'$s'}; " '<a>foo</a><bc>bar</bc> <def>baz</def> "x" <ghij>%&*</ghij>' 'foobar baz "x" %&*' '<B>foo</ b > <efg>bar</efg> "stuff" <cD >*&!</ Cd>' 'foo <efg>bar</efg> "stuff" *&!'

Update: I just noticed the "and their content" requirement in the OPed title and output examples. Here's a two-pass regex solution (Update: Changed to make more modular, self-documenting):

>perl -wMstrict -le "my $s = '<B>foo</ b > <EfG>bar</eFg> \"stuff\" <cD >*&!</ Cd> <x>baz</x>'; print qq{'$s'}; ;; my $ar_tag_delete_content = [ 1, tag_group_regex(qw(efg) ) ]; my $ar_tag_leave_content = [ 0, tag_group_regex(qw(b cd)) ]; ;; for my $pass ($ar_tag_leave_content, $ar_tag_delete_content) { my ($delete_content, $tag) = @$pass; use re 'eval'; $s =~ s{ < \s* ($tag) \s* > ((?: (?! </ \s* \1) .)*) </ \s* ([^>]*) (?(?{ lc($1) ne lc($^N) }) (*F)) \s* > } { $delete_content ? '' : $2 }xmsge; print qq{'$s'}; } ;; sub tag_group_regex { my $alternation = join '|', @_; return qr{ (?i) $alternation }xms; } " '<B>foo</ b > <EfG>bar</eFg> "stuff" <cD >*&!</ Cd> <x>baz</x>' 'foo <EfG>bar</eFg> "stuff" *&! <x>baz</x>' 'foo "stuff" *&! <x>baz</x>'

Further Update:
Hey, wait a minute...
Does the foregoing even work?
Answer: No. Try it with the string  '<b>foo</B> bar <b>baz</B>' and it falls over.

The following works better, is simpler, and also gets rid of the quite unnecessary  (?(?{ lc($1) ne lc($^N) }) (*F)) business. (But this is still quite naive and fragile code for processing XML!)

>perl -wMstrict -le "my @strings = ( '<B>foo</ b > <EfG>bar</eFg> \"stuff\" <cD >*&!</ Cd> <x>baz</x>', '<b>fee</B> P <b>fie</B> Q <efg>foe</EFG> R <efg>fum</EFG> S', '<b>hee</b> W <b>hie</b> X <efg>hoe</efg> Y <efg>hum</efg> Z', ); ;; my $ar_keep_tag_content = [ 1, tag_group_regex(qw(b cd)) ]; my $ar_kill_tag_content = [ 0, tag_group_regex(qw(efg) ) ]; ;; for my $s (@strings) { print qq{'$s'}; for my $pass ($ar_keep_tag_content, $ar_kill_tag_content) { my ($keep_content, $tag) = @$pass; $s =~ s{ < \s* ($tag) \s* > (.*?) </ \s* (?i) \1 \s* > } { $keep_content ? $2 : '' }xmsge; print qq{'$s'}; } print ''; } ;; sub tag_group_regex { my $alternation = join '|', @_; return qr{ (?i) $alternation }xms; } " '<B>foo</ b > <EfG>bar</eFg> "stuff" <cD >*&!</ Cd> <x>baz</x>' 'foo <EfG>bar</eFg> "stuff" *&! <x>baz</x>' 'foo "stuff" *&! <x>baz</x>' '<b>fee</B> P <b>fie</B> Q <efg>foe</EFG> R <efg>fum</EFG> S' 'fee P fie Q <efg>foe</EFG> R <efg>fum</EFG> S' 'fee P fie Q R S' '<b>hee</b> W <b>hie</b> X <efg>hoe</efg> Y <efg>hum</efg> Z' 'hee W hie X <efg>hoe</efg> Y <efg>hum</efg> Z' 'hee W hie X Y Z'

Replies are listed 'Best First'.
Re^2: remove xml tag and their content
by Anonymous Monk on Oct 28, 2011 at 01:07 UTC

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://934278]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having a coffee break in the Monastery: (5)
As of 2024-04-18 06:15 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found