http://www.perlmonks.org?node_id=743443

kgullekson has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I've got basic Perl skills and am even worse at understanding the XML::Parser so hope someone can help.

I've got an xml file that has lines like:
<customBucket></customBucket> <customDimensionName>Strategy</customDimensionName> <customBucketValueString>Test1</customBucketValueString> <customBucketEnd></customBucketEnd> <customBucket></customBucket> <customDimensionName>SubStrategy</customDimensionName> <customBucketValueString>Test2</customBucketValueString> <customBucketEnd></customBucketEnd>

My Perl script currently uses Start, End and Char handlers. The Start and End handlers just do things like indent stuff etc and the Char handler manipulates tag names and values in certain cases. I need to be able to suppress printing the lines starting with the customBucket tag and ending with the customBucketEnd tag based on the value of the customDimensionName. So, for example, if the value of CustomDimensionName is "Strategy", I don't want to print lines 1-4 in my example above.

I'm thinking that somehow this has to be done in the Start handler but I can't see a way of knowing what's following the customBucket element without actually getting into the Char handler but by then, I've already printed out my customBucket line.

I was thinking that I could use some sort of stack to push each element on and then only pop off when I knew the 4 lines were something I wanted to print. This just seemed messy.

The script I have currently does a lot of other stuff so I can't really change the functionality too much without risking messing up something else.

Apologies if I've not used the right terminology. Hope the above makes sense.

Thanks for any help!
  • Comment on How to exclude certain blocks of an XML file using Perl and XML::Parser
  • Download Code

Replies are listed 'Best First'.
Re: How to exclude certain blocks of an XML file using Perl and XML::Parser
by toolic (Bishop) on Feb 12, 2009 at 23:15 UTC
    If you have no control over your XML format and you are stuck using XML::Parser, then you can igonre the rest of this post. Otherwise, please consider the following.

    While I'm sure what you have is valid XML, it's too bad it isn't structured differently. Had your CustomDimensionName and customBucketValueString elements been children of customBucket, you could have filtered your information as follows:

    use strict; use warnings; use XML::Twig; my $xmlStr = <<XML; <foo> <customBucket> <customDimensionName>Strategy</customDimensionName> <customBucketValueString>Test1</customBucketValueString> </customBucket> <customBucket> <customDimensionName>SubStrategy</customDimensionName> <customBucketValueString>Test2</customBucketValueString> </customBucket> </foo> XML my $t = XML::Twig->new(); $t->parse($xmlStr); for my $bucket ($t->root()->children('customBucket')) { if ($bucket->first_child('customDimensionName')->text() ne 'Strate +gy') { print 'customDimensionName ' , $bucket->first_child('custom +DimensionName' )->text(), "\n"; print 'customBucketValueString ', $bucket->first_child('custom +BucketValueString')->text(), "\n"; } } __END__ customDimensionName SubStrategy customBucketValueString Test2

    I abandoned using XML::Parser once I discovered XML::Twig, which I find to be easier to understand and use.

      Thanks for the info. Yeah, I know the format is a bit bizarre. Plus, I just checked and I don't have XML::Twig available (it looks like it has much more functionality). I appreciate your response.

        XML::Twig is pure Perl, so you can just stick Twig.pm (from the XML-Twig distribution) somewhere, add a use lib 'somewhere'; in your code, and use it.

Re: How to exclude certain blocks of an XML file using Perl and XML::Parser
by GrandFather (Saint) on Feb 12, 2009 at 23:21 UTC

    Better structured XML would help! However you can do what you want by using what is effectively a state machine and deferring output until you can make the decision. Consider:

    use strict; use warnings; use XML::Parser; my $xmlStr = <<'END_XML'; <doc> <customBucket></customBucket> <customDimensionName>Strategy</customDimensionName> <customBucketValueString>Test1</customBucketValueString> <customBucketEnd></customBucketEnd> <customBucket></customBucket> <customDimensionName>SubStrategy</customDimensionName> <customBucketValueString>Test2</customBucketValueString> <customBucketEnd></customBucketEnd> </doc> END_XML my %state = (printing => 1); my $parser = XML::Parser->new (Style => 'Stream', Pkg => '::main'); $parser->parse ($xmlStr); sub StartTag { my ($p, $elt) = @_; my $eltStr = $_; if ($elt eq 'customBucket') { $state{printing} = undef; } elsif ($elt eq 'customDimensionName') { $state{capturing} = 1; $state{capture} = ''; } if ($state{printing}) { print $eltStr; } else { $state{text} .= $eltStr; } } sub Text { my ($p) = @_; my $str = $_; if ($state{printing}) { print $str; } else { $state{text} .= $str; $state{capture} .= $str if $state{capturing}; } } sub EndTag { my ($p, $elt) = @_; my $eltStr = $_; if ($state{printing}) { print $eltStr; } else { $state{text} .= $eltStr; } if ($elt eq 'customBucketEnd') { if ($state{capture} !~ '^Strategy$') { print "$state{text}"; } $state{text} = ''; $state{capture} = ''; $state{printing} = 1; } elsif ($elt eq 'customDimensionName') { $state{capturing} = undef; } }

    Prints:

    <doc> <customBucket></customBucket> <customDimensionName>SubStrategy</customDimensionName> <customBucketValueString>Test2</customBucketValueString> <customBucketEnd></customBucketEnd> </doc>

    True laziness is hard work
Re: How to exclude certain blocks of an XML file using Perl and XML::Parser
by mirod (Canon) on Feb 13, 2009 at 09:04 UTC

    If you can go the XML::Twig way, here is a piece of code that would work:

    #!/usr/bin/perl use strict; use warnings; use XML::Twig; XML::Twig->new( twig_handlers => { customBucketEnd => \&end_bucket }, pretty_print => 'indented', ) ->parse( \*DATA) ->flush; # assumes the customBucket and customDimensionName elements are ALWAYS + present sub end_bucket { my( $t, $end)= @_; my @bucket_content; if( $end->prev_sibling( 'customDimensionName')->text eq 'Strategy' +) { while(1) { $end->cut; if( $end->tag eq 'customBucket') { last; } $end= $end->former_prev_sibling; # an obscure method that c +omes in handy sometimes } } else { $t->flush; } } __DATA__ <doc> <customBucket></customBucket> <customDimensionName>Strategy</customDimensionName> <customBucketValueString>Test1</customBucketValueString> <customBucketEnd></customBucketEnd> <customBucket></customBucket> <customDimensionName>SubStrategy</customDimensionName> <customBucketValueString>Test2</customBucketValueString> <customBucketEnd></customBucketEnd> </doc>