Maximum parsing depth with XML::Parser?

ar0n has asked for the wisdom of the Perl Monks concerning the following question:

I have an XML-file which contains different tags, among them a <txt> tag.

<main>
    <txt>
        this is text. don't forget to look at <a href="http://perlmonk
+s.org">perlmonks</a>
        don't you just hate <blink>blinking</blink> text?
    </txt>
</main>
[download]

Now what I want is to be able to tell the XML::Parser object to ignore any tags that
appear within the <txt> elements. I want this because there appear html-elements within
the <txt> tags, like <img src="img.png"> or <hr>, which
as we all know, is not valid xml.

So, is there a way for XML::Parser to not parse content below a
certain depth, or better yet, ignore content within certain tags?

With ignore, i mean to pass along as a Cdata string to the
appropiate handler, without processing it further.

I've read the whole manpage on XML::Parser, and the only thing that even comes close
seems to be the Stream_Delimiter option, which on encountering a certain string stops
parsing totally.

Thanks.

-- ar0n | Just Another Perl Joe

Comment on Maximum parsing depth with XML::Parser? Select or Download Code

Replies are listed 'Best First'.
RE: Maximum parsing depth with XML::Parser? by tilly (Archbishop) on Aug 06, 2000 at 05:38 UTC
If you have an XML file with invalid elements, then it is not a valid XML file and should not be parsable by any validating XML tool. Therefore XML::Parser both can and will have problems with it. Instead you will either need to roll your own parser or else properly escape the text before you place it within the text tag so you have valid XML.	[reply]
Re: Maximum parsing depth with XML::Parser? by reptile (Monk) on Aug 06, 2000 at 07:17 UTC
CDATA! I think XML::Parser recognizes it. Something like this, I believe: `<txt><![CDATA[ anything in here should be ok like <tags> & entities and will be taken literally ]]></txt>` [download] Is that the right tag? Anyway, that essentially tells the parser not to parse anything inside that. I'm pretty sure it works with XML::Parser but you'll have to test it to be sure. `local $_ = "0A72656B636148206C72655020726568746F6E41207473754A"; while(s/..$//) { print chr(hex($&)) }` [download]	[reply] [d/l] [select]
Re: Maximum parsing depth with XML::Parser? by Speedfreak (Sexton) on Aug 06, 2000 at 17:20 UTC
You are allowed this sort of data in tags as long as its marked as CDATA (character data, not to be parsed) The CDATA tag takes the format <![CDATA...]> where the ... is the text data you dont want to be parsed. Therefore, your XML should look like: `<main> <txt> <![CDATA this is text. don't forget to look at <a href="http://perl +monks.org">perlmonks</a> don't you just hate <blink>blinking</blink> text? ]]> </txt> </main>` [download] There is another get around: convert all the <'s in the text not to be parsed to < which will look like: Its not a problem with the parser, more the format of your XML. `<main> <txt> this is text. don't forget to look at <a href="http://perlm +onks.org">perlmonks</a> don't you just hate <blink>blinking</blink> text? </txt> </main>` [download] This basically stops anything tags being seen by the parser. Not great but it works. Oh, and if your using the CDATA tags, dont forget to escape anything that may look like the end ]]> in your data. - Jed	[reply] [d/l] [select]
Re: Maximum parsing depth with XML::Parser? by davorg (Chancellor) on Aug 06, 2000 at 11:41 UTC
I agree with what tilly says aboutXML::Parser not parsing invalid XML, but you might be able to work around it. It depends which XML::Parser style you're using. If you're using the 'Subs' style or defiing your own handlers, then it might be possible to set a global flag within the subroutine that recognises a text element. You could then check for this flag in other subs and do nothing if it's one. You'd reset the flag when you see the end of the text tag. -- <http://www.dave.org.uk> European Perl Conference - Sept 22/24 2000, ICA, London <http://www.yapc.org/Europe/>	[reply]


go ahead... be a heretic
	PerlMonks