Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery

CDATA-like "literal" tags in XML-like data

by John M. Dlugosz (Monsignor)
on Nov 09, 2001 at 04:03 UTC ( [id://124247] : perlquestion . print w/replies, xml ) Need Help??

John M. Dlugosz has asked for the wisdom of the Perl Monks concerning the following question:

How can I process "lazy" XML like our <code> tags? The best solution would work within the Twig framework, but here is a stand-alone preprocessor that does it.

This concept demo below will scan the proto-XML and escape out chars in the elements that are supposed to be literal.

I thought about using Parse::RecDecent, or other parsing technology, but it should be a simple problem. I'm wondering if this general idea, of using cascaded RE's with a continuing "pos", can be improved.

use strict; use warnings; sub is_literal ($$) { my ($name, $attrs)= @_; return ($name eq 'listing') || ($name eq 'signature'); # simple demo +. # change this to analyse $name and $attrs to decide whether to treat +this literally. } sub escape_out ($) { my $passage= shift; $passage =~ s/&/&amp;/g; $passage =~ s/</&lt;/g; return "[[[* $passage *]]]"; # [[[]]] to visibly show that the right + "bite" was taken. } sub scan ($) { my @passages; my $line= shift; # first pass: note what sections need treatment, without actually mod +ifying the string. # modifying the string would mess up the "pos" used by the RE's. while ($line =~ m/<\s*(\w+)([^>]*)>/g) { # for every start tag... my $startpos= pos($line); my $name= $1; if (is_literal ($name, $2)) { # if targeted, find the matching end tag using simple pattern ( +ignoring other stuff). # this skips that passage for the continued search of all start + tags. $line =~ m/<\/$name>/g; my $endpos= pos($line); unshift @passages, [$startpos, $endpos-(length($name)+3)]; } } # second pass: process the sections noted above, from right-to-left s +o # positions don't change. foreach my $range (@passages) { my ($start, $end)= @$range; my $length= $end-$start; substr($line, $start, $length)= escape_out (substr($line, $start, +$length)); # is there an easier way to do that without substr'ing twice? } print $line; } my $testdata= <<'EOF'; <method name="mainloop"> <signature virtual="1">int mainloop (ratwin::message::MSG&)</sig +nature> <P>This is the canonocal logic of the message pump. It looks ap +roximatly like this:</P> <listing> use & and <things> in here. MSG msg; while ( GetMessage(msg) ) { if (msg.hwnd == 0) thread_message (msg); else { if (!pre_translate (msg)) { // check IsDialog, Trans +lateAccelerator if (!translate_key_even(msg)) // Win32 TranslateMe +ssage DispatchMessage(msg); } } } return (msg.wParam); </listing> <P>Override this if you need to customize this beyond the point +provided for by the virtual functions provided for the individual steps.< +/P> </method> EOF scan ($testdata);

Replies are listed 'Best First'.
Re: CDATA-like "literal" tags in XML-like data
by danger (Priest) on Nov 09, 2001 at 12:09 UTC

    Well, rather than comment on the potential fragility in such parsing schemes, I'll suggest a simplification to your scan() routine (reducing the loc by half+):

    sub scan ($) { my $line = shift; while ($line =~ m/<\s*(\w+)([^>]*)>/g) { next unless is_literal($1,$2); my $start = pos($line); my $len = index($line,"</$1>",$start) - $start; my $passage = \substr($line,$start,$len); $$passage = escape_out($$passage); pos($line) = $start; } print $line; }

    This uses assignment to pos() at the end of the loop to reset to where we left off so we may continue our match after modifying the string. Also, this uses a reference to the substr() function ... this is a reference to an Lvalue so assigning through the reference changes the substring being pointed to (perhaps a wee bit obfu for production use, but that's your decision :-)

    Of course, if the data doesn't follow exactly according to your expectations (a closing </listing > tag for example won't be found because we didn't allow for a trailing space in the closing tag, nor did we check that index() found a closing tag, ...), then all bets are off for your preprocessor (OK, so I did make a fragility comment).

      $passage = \substr

      Yes, that's exactly what I was trying for. I seem to recall trying to do this before, but it didn't work. Maybe I never got the syntax right, or it wasn't right on earlier versions.

      As for exactly following expectations, that's the point: it's all literal until a very strict end condition is reached. The <<EOF ... \nEOF construct is "fragile" too, as is forgetting to escape out a slash in a RE. Either follow the rules or get an error when things don't match up.


Re: CDATA-like "literal" tags in XML-like data
by mirod (Canon) on Nov 09, 2001 at 17:46 UTC

    I hate to look like an XML ayatollah but I think you are going down a slippery path. XML is XML, and what you want is not XML. XML gives you native ways to encode your "literal" chunks so the parser is happy with them. You should use them. If you want a different format then you should use a pre-processor, to turn your quasi-XML into real XML. As the XML parser will never see the original file you can just have a special marker for the beginning and end of literal code, you don't need to use attributes on existing tags. You can basically use anything, I would use something illegal in XML and unlikely to happen in your literal text, &&& for example, or a tag if you really want to:

    You pre-processor would then be as simple as this:

    #!/usr/bin/perl -w use strict; my $literal_tag= "literal"; { local undef $/; for (<DATA>) { # tag version, the && version would be even simpler # s{&&&(.*?)&&& }{xml_escape($1)}ges; s{<\s*$literal_tag\s*>(.*?)<\s*/\s*$literal_tag>} {xml_escape($1)}geso; print; } } sub xml_escape { my $literal= shift; $literal=~ s/&/&amp;/g; $literal=~ s/</&lt;/g; return $literal; } __DATA__ <doc> <p>A regular para</p> <code><literal>there you put the code you want, including & and <> and all</literal>

    another regular para

    <literal>more <> code & stuff</literal> </doc> </code>

    Frankly using CDATA sections is simpler and let your original documents be well-formed XML, but that's your call.

      Hmm, I really like the idea of not being context-sensitive. Yes, just because PM has magic <code> tags doesn't mean it's the only solution to the ergonomic problem.

      Having a single sequence that was used to delimit literal text, rather than configurable multiple sequences, would mean that a simple s/// operator would be able to find them.

      In fact, if separate begin and end sequences were used, they could be turned into "<![CDATA[" and "]]>"respecitivly, and not need to grab everything in between. That would be very useful to operate as a simple filter of input.