Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

Regular expression to replace xml data

by dalegribble (Initiate)
on Oct 06, 2009 at 14:14 UTC ( [id://799488]=perlquestion: print w/replies, xml ) Need Help??

dalegribble has asked for the wisdom of the Perl Monks concerning the following question:

I'm modifying the regular expressions for a function that replaces less than, greater than, and ampersand characters in an xml data string with their html equivalents. The test code that I'm currently using is:

my $str = '<Data1>Data</Data1><Data2></Data2><Data3> < </Data3>'; $str =~ s/>(.*?)<(.*?)<\//>$1&lt;$2<\//g;

My desired output is:

<Data1>Data</Data1><Data2></Data2><Data3> &lt; </Data3>

But instead is displaying:

<Data1>Data&lt;/Data1><Data2></Data2>&lt;Data3> < </Data3>

Any suggestions on my current regular expression are greatly appreciated.

Replies are listed 'Best First'.
Re: Regular expression to replace xml data
by Fletch (Bishop) on Oct 06, 2009 at 14:28 UTC

    Stop now. Processing XML with just regular expressions is destined for failure*. Use a proper parser (e.g. XML::Twig) and have it work on just the text contents of nodes rather than trying to gin up something yourself. You'll be glad you did later when you're not chasing down corner cases when your input drifts n months hence.

    (*) Save in very simple cases with very regular input which you can guarantee doesn't vary.

    The cake is a lie.
    The cake is a lie.
    The cake is a lie.

Re: Regular expression to replace xml data
by marto (Cardinal) on Oct 06, 2009 at 14:32 UTC
      Thanks for both of your replies, our current parser is dying due to the additional less than / greater than character.

        Your current parser is dying because it is trying to sew jeans with a knitting needle.

        Regexen are not the right tool for parsing languages for which the meaning of a token depends heavily on context or is part of a recursively nested pair. XML has both those features. Your regex isn't working anymore because it is having difficulty determining the context of the greater than and less than signs it is trying to replace.

        I will grant you that you can get a regex based parser to work for controlled set of input, but no matter how hard you try it will be fragile. And the more you try to make it work, the more difficult it will be to explain and maintain those regexen.

        Rewriting something that you trust is never fun, especially if you are fighting deadlines, but the problems you are having signal that you have outgrown the capacity of your old tools and it is time to move on to better tools.

        Best, beth

Re: Regular expression to replace xml data
by Jenda (Abbot) on Oct 06, 2009 at 15:38 UTC

    Looks like other responders did not understand the query so let me reword it. You've got some invalid XML-like stuff that you want to (attempt to) fix so that you end up with parsable XML. Do I have it right?

    You may want to have a look at the PolishHTML subroutine within HTML::JFilter

    Jenda
    Enoch was right!
    Enjoy the last years of Rome.

      I did understand that the OP is trying to clean up XML-like stuff: greater and less than signs in the data portion of arbitrary tags is not valid XML, so by definition, this parsing exercise is a clean up exercise.

      However, it doesn't change my observation that handling this with regexen will (a) apply only to special cases (b) will require complex regexen and (c) will require going beyond the regex paradigm. For example, the following code will parse his XML sample correctly, but it only works if we can guarantee that same named tags are never nested.

      use strict; use warnings; my $str = '<Data1>Data</Data1><Data2></Data2><Data3> < </Data3>'; # this code only works if same name tags are never nested # in your XML-like samples. $str =~ s/^\s+//; my $sResult=''; while ($str =~ m{<(\w+)>((?:[^<]|<(?!/\1>))*)</\1>\s*}g) { my $tag = $1; my $innards = $2; $innards =~ s/</&lt;/; $innards =~ s/>/&gt;/; $sResult .= "<$tag>$innards</$tag>"; } print STDERR "output: $sResult\n";

      Your own module (HTML::JFilter) handles the nested case with grace and it even uses only regular expressions, but you can hardly claim this is a simple set of regular expressions:

      sub PolishHTML { my $str = shift; if ($AllowXHTML) { $str =~ s{(.*?)(&\w+;|&#\d+;|<\w[\w\d:\-]*(?:\s+\w[\w\d:\-]*(?:\s* +=\s*(?:[^" '><\s]+|(?:'[^']*')+|(?:"[^"]*")+))?)*\s*/?>|</\w[\w\d:\-] +*>|<!--.*?-->|$)} {HTML::Entities::encode($1, '^\r\n\t !\#\$%\"\'-;=?-~ +').$2}gem; } else { $str =~ s{(.*?)(&\w+;|&#\d+;|<\w[\w\d:\-]*(?:\s+\w[\w\d:\-]*(?:\s* +=\s*(?:[^" '><\s]+|(?:'[^']*')+|(?:"[^"]*")+))?)*\s*>|</\w[\w\d:\-]*> +|<!--.*?-->|$)} {HTML::Entities::encode($1, '^\r\n\t !\#\$%\"\'-;=?-~ +').$2}gem; } return $str; }

      Given the complexities of writing and maintaining this sort of code, relying on pre-built and pre-tested modules (such as you have suggested) is very good idea. Even so the modules need to be carefully evaluated to make sure they can handle the particular range of XML-like text one needs to process.

      Best, beth

      Update: fixed typo in my code ((?:[^>]|< was a typo. Should have been (?:[^<]|<

Re: Regular expression to replace xml data
by mirod (Canon) on Oct 06, 2009 at 19:59 UTC

    This is not an easy problem. It's quite easy to get partial solutions, and real hard to get a perfect one. The good news is that as long as you are conservative in what you fix, then the XML parser will tell you about what you missed and no one will be hurt in the process ;--)

    Also I would assume that what you get is not pathological, designed to trip the parser, but more like "XML by dummies", who don't know the spec, or what a parser is. So probably no CDATA section, no comments, no '>' in attribute values.

    My first attempt would look like this:

    If we find 2 successive '>' without a '<' in between, then the second '>' should be turned into an entity (the first one closes a tag, but not the second one). Same with 2 successive '<' without a '>' in between, the first '< is not part of the markup (the second one opens a tag, but not the first one). For &, if it doesn't look like an entity, &name; or &#..., then turn it into &amp;

    #!/usr/bin/perl use strict; use warnings; while( <DATA>) { s{>([^<]*)>}{>$1&gt;}g; s{<([^>]*)<}{&gt;$1<}g; s{&(?!\w+;|#)}{&amp;}g; print; } __DATA__ <doc><data>></data><data>if( 1 < 2 && 2 < 3)</data></doc>

    This doesn't catch the case of an < / > pair that's not part of a tag, as in 'if( $a<$b || $a > $c)'. You can improve this by first trying to catch separately <s, they're easier than >s, as if they are not followed by /?\w+, then they can't be mark-up (once again a simplification, the first character of the tag name can't be a digit).

    Also some constructs that might look like entities but are not, like '&#foo', and you could also improve the regexp there. But we are getting to the limits of what's reasonable here.

    It all depends of what you want. Limit the number of cases where you have to manually fix the data, or never encounter any well-formedness error.

    <pEdited: improved explanations (hopefully!)

Re: Regular expression to replace xml data
by grizzley (Chaplain) on Oct 07, 2009 at 08:19 UTC

    I agree with others' posts, but if you have only this case and want straight answer to your question, here it is:

    $str =~ s/>([^<>]*?)<([^<>]*?)<\//>$1&lt;$2<\//g;
      Thank you SO MUCH. I, too, agree with the previous posts, but in my case I can't do anything about the XML parser already in place.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://799488]
Approved by marto
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others imbibing at the Monastery: (6)
As of 2024-04-25 13:51 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found