Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Maximum parsing depth with XML::Parser?

by ar0n (Priest)
on Aug 06, 2000 at 05:25 UTC ( [id://26379]=perlquestion: print w/replies, xml ) Need Help??

ar0n has asked for the wisdom of the Perl Monks concerning the following question:

I have an XML-file which contains different tags, among them a <txt> tag.
<main> <txt> this is text. don't forget to look at <a href="http://perlmonk +s.org">perlmonks</a> don't you just hate <blink>blinking</blink> text? </txt> </main>
Now what I want is to be able to tell the XML::Parser object to ignore any tags that
appear within the <txt> elements. I want this because there appear html-elements within
the <txt> tags, like <img src="img.png"> or <hr>, which
as we all know, is not valid xml.

So, is there a way for XML::Parser to not parse content below a
certain depth, or better yet, ignore content within certain tags?

With ignore, i mean to pass along as a Cdata string to the
appropiate handler, without processing it further.

I've read the whole manpage on XML::Parser, and the only thing that even comes close
seems to be the Stream_Delimiter option, which on encountering a certain string stops
parsing totally.

Thanks.

-- ar0n | Just Another Perl Joe

Replies are listed 'Best First'.
RE: Maximum parsing depth with XML::Parser?
by tilly (Archbishop) on Aug 06, 2000 at 05:38 UTC
    If you have an XML file with invalid elements, then it is not a valid XML file and should not be parsable by any validating XML tool.

    Therefore XML::Parser both can and will have problems with it.

    Instead you will either need to roll your own parser or else properly escape the text before you place it within the text tag so you have valid XML.

Re: Maximum parsing depth with XML::Parser?
by reptile (Monk) on Aug 06, 2000 at 07:17 UTC

    CDATA! I think XML::Parser recognizes it. Something like this, I believe:

    <txt><![CDATA[ anything in here should be ok like <tags> & entities and will be taken literally ]]></txt>

    Is that the right tag? Anyway, that essentially tells the parser not to parse anything inside that. I'm pretty sure it works with XML::Parser but you'll have to test it to be sure.

    local $_ = "0A72656B636148206C72655020726568746F6E41207473754A"; while(s/..$//) { print chr(hex($&)) }

Re: Maximum parsing depth with XML::Parser?
by Speedfreak (Sexton) on Aug 06, 2000 at 17:20 UTC

    You are allowed this sort of data in tags as long as its marked as CDATA (character data, not to be parsed)

    The CDATA tag takes the format <![CDATA...]> where the ... is the text data you dont want to be parsed.

    Therefore, your XML should look like:

    <main> <txt> <![CDATA this is text. don't forget to look at <a href="http://perl +monks.org">perlmonks</a> don't you just hate <blink>blinking</blink> text? ]]> </txt> </main>

    There is another get around: convert all the <'s in the text not to be parsed to &lt; which will look like:

    Its not a problem with the parser, more the format of your XML.

    <main> <txt> this is text. don't forget to look at &lt;a href="http://perlm +onks.org">perlmonks&lt;/a> don't you just hate &lt;blink>blinking&lt;/blink> text? </txt> </main>

    This basically stops anything tags being seen by the parser. Not great but it works.

    Oh, and if your using the CDATA tags, dont forget to escape anything that may look like the end ]]> in your data.

    - Jed

Re: Maximum parsing depth with XML::Parser?
by davorg (Chancellor) on Aug 06, 2000 at 11:41 UTC

    I agree with what tilly says aboutXML::Parser not parsing invalid XML, but you might be able to work around it.

    It depends which XML::Parser style you're using. If you're using the 'Subs' style or defiing your own handlers, then it might be possible to set a global flag within the subroutine that recognises a text element. You could then check for this flag in other subs and do nothing if it's one. You'd reset the flag when you see the end of the text tag.

    --
    <http://www.dave.org.uk>

    European Perl Conference - Sept 22/24 2000, ICA, London
    <http://www.yapc.org/Europe/>

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://26379]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others examining the Monastery: (5)
As of 2024-04-26 09:30 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found