Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

remove xml tag and their content

by zac_carl (Acolyte)
on Oct 27, 2011 at 13:27 UTC ( [id://934149]=perlquestion: print w/replies, xml ) Need Help??

zac_carl has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,

Need your help on this.I have following data and i want to remove xml tags on this.I want to get output like this

ABCD ABCD DEFG (if the line has two site names.
"<site>ABCD</site> <timestamp>20061201182407.825Z</timestamp>" "<site>ABCD</site> <timestamp>20061201182407.825Z</timestamp>" "<site>ABCD</site> <timestamp>20061201182407.825Z</timestamp>" "<site>ABCD</site> <timestamp>20061201182407.825Z</timestamp>" "<site>ABCD</site> <timestamp>20090819195525.418Z</timestamp> < +site>ABCD</site> <timestamp>20090819195525.419Z</timestamp>" "<site>ABCD</site> <timestamp>20070925153328.933Z</timestamp>" <site>ABCD</site><timestamp>20110214175730.608Z</timestamp> <site>ABCD</site><timestamp>20110214175730.608Z</timestamp>
Thanks for help

Replies are listed 'Best First'.
Re: remove xml tag and their content
by moritz (Cardinal) on Oct 27, 2011 at 13:34 UTC
Re: remove xml tag and their content
by AnomalousMonk (Archbishop) on Oct 28, 2011 at 00:46 UTC

    Use of one of the fine CPAN XML parsing modules is almost certainly the best course. Perhaps some monk better versed than I in XML parsing can suggest appropriate choices. Novice monks often protest that these XML modules represent "too much code for my application" and want "just a simple" solution. This desire is usually a snare and a delusion: XML is complicated, and "simple" solutions are fragile and scale poorly.

    However, if you are set on a simple solution, here are a couple of regex-based ones. Both can operate on strings containing embedded double-quotes and other stuff. Again, both are inherently fragile. The second approach is both more specific as to the tags to be deleted and more tolerant of tag casing and whitespace.

    Update: Changed following code example to be more Windose double-quote-friendly.

    >perl -wMstrict -le "my $s = '<a>foo</a><bc>bar</bc> <def>baz</def> \"x\" <ghij>%&*</ghij>'; print qq{'$s'}; ;; $s =~ s{ < ([^>]+) > ((?: (?! </ \1) .)*) </ \1 > }{$2}xmsg; print qq{'$s'}; ;; ;; $s = '<B>foo</ b > <efg>bar</efg> \"stuff\" <cD >*&!</ Cd>'; print qq{'$s'}; ;; my @tags = qw(b cd); my $tag = join '|', @tags; $tag = qr{ (?i) $tag }xms; use re 'eval'; $s =~ s{ < \s* ($tag) \s* > ((?: (?! </ \s* \1) .)*) </ \s* ([^>]*) (?(?{ lc($1) ne lc($^N) }) (*F)) \s* > } {$2}xmsg; print qq{'$s'}; " '<a>foo</a><bc>bar</bc> <def>baz</def> "x" <ghij>%&*</ghij>' 'foobar baz "x" %&*' '<B>foo</ b > <efg>bar</efg> "stuff" <cD >*&!</ Cd>' 'foo <efg>bar</efg> "stuff" *&!'

    Update: I just noticed the "and their content" requirement in the OPed title and output examples. Here's a two-pass regex solution (Update: Changed to make more modular, self-documenting):

    >perl -wMstrict -le "my $s = '<B>foo</ b > <EfG>bar</eFg> \"stuff\" <cD >*&!</ Cd> <x>baz</x>'; print qq{'$s'}; ;; my $ar_tag_delete_content = [ 1, tag_group_regex(qw(efg) ) ]; my $ar_tag_leave_content = [ 0, tag_group_regex(qw(b cd)) ]; ;; for my $pass ($ar_tag_leave_content, $ar_tag_delete_content) { my ($delete_content, $tag) = @$pass; use re 'eval'; $s =~ s{ < \s* ($tag) \s* > ((?: (?! </ \s* \1) .)*) </ \s* ([^>]*) (?(?{ lc($1) ne lc($^N) }) (*F)) \s* > } { $delete_content ? '' : $2 }xmsge; print qq{'$s'}; } ;; sub tag_group_regex { my $alternation = join '|', @_; return qr{ (?i) $alternation }xms; } " '<B>foo</ b > <EfG>bar</eFg> "stuff" <cD >*&!</ Cd> <x>baz</x>' 'foo <EfG>bar</eFg> "stuff" *&! <x>baz</x>' 'foo "stuff" *&! <x>baz</x>'

    Further Update:
    Hey, wait a minute...
    Does the foregoing even work?
    Answer: No. Try it with the string  '<b>foo</B> bar <b>baz</B>' and it falls over.

    The following works better, is simpler, and also gets rid of the quite unnecessary  (?(?{ lc($1) ne lc($^N) }) (*F)) business. (But this is still quite naive and fragile code for processing XML!)

    >perl -wMstrict -le "my @strings = ( '<B>foo</ b > <EfG>bar</eFg> \"stuff\" <cD >*&!</ Cd> <x>baz</x>', '<b>fee</B> P <b>fie</B> Q <efg>foe</EFG> R <efg>fum</EFG> S', '<b>hee</b> W <b>hie</b> X <efg>hoe</efg> Y <efg>hum</efg> Z', ); ;; my $ar_keep_tag_content = [ 1, tag_group_regex(qw(b cd)) ]; my $ar_kill_tag_content = [ 0, tag_group_regex(qw(efg) ) ]; ;; for my $s (@strings) { print qq{'$s'}; for my $pass ($ar_keep_tag_content, $ar_kill_tag_content) { my ($keep_content, $tag) = @$pass; $s =~ s{ < \s* ($tag) \s* > (.*?) </ \s* (?i) \1 \s* > } { $keep_content ? $2 : '' }xmsge; print qq{'$s'}; } print ''; } ;; sub tag_group_regex { my $alternation = join '|', @_; return qr{ (?i) $alternation }xms; } " '<B>foo</ b > <EfG>bar</eFg> "stuff" <cD >*&!</ Cd> <x>baz</x>' 'foo <EfG>bar</eFg> "stuff" *&! <x>baz</x>' 'foo "stuff" *&! <x>baz</x>' '<b>fee</B> P <b>fie</B> Q <efg>foe</EFG> R <efg>fum</EFG> S' 'fee P fie Q <efg>foe</EFG> R <efg>fum</EFG> S' 'fee P fie Q R S' '<b>hee</b> W <b>hie</b> X <efg>hoe</efg> Y <efg>hum</efg> Z' 'hee W hie X <efg>hoe</efg> Y <efg>hum</efg> Z' 'hee W hie X Y Z'
Re: remove xml tag and their content
by AcidHawk (Vicar) on Oct 27, 2011 at 15:14 UTC

    More info is needed

    The data you supplied ... is that all or is it valid xml ... what I mean is will it be read by something like Internet Explorer without errors.. the example you supplied will only been seen as a text file.

    Also not all lines are surrounded with quotes .. is this intentional or a typo..?

    Of all the things I've lost in my life, its my mind I miss the most.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://934149]
Approved by moritz
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others contemplating the Monastery: (7)
As of 2024-04-18 05:21 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found