Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister

Re: Regular expression to replace xml data

by mirod (Canon)
on Oct 06, 2009 at 19:59 UTC ( #799575=note: print w/replies, xml ) Need Help??

in reply to Regular expression to replace xml data

This is not an easy problem. It's quite easy to get partial solutions, and real hard to get a perfect one. The good news is that as long as you are conservative in what you fix, then the XML parser will tell you about what you missed and no one will be hurt in the process ;--)

Also I would assume that what you get is not pathological, designed to trip the parser, but more like "XML by dummies", who don't know the spec, or what a parser is. So probably no CDATA section, no comments, no '>' in attribute values.

My first attempt would look like this:

If we find 2 successive '>' without a '<' in between, then the second '>' should be turned into an entity (the first one closes a tag, but not the second one). Same with 2 successive '<' without a '>' in between, the first '< is not part of the markup (the second one opens a tag, but not the first one). For &, if it doesn't look like an entity, &name; or &#..., then turn it into &amp;

#!/usr/bin/perl use strict; use warnings; while( <DATA>) { s{>([^<]*)>}{>$1&gt;}g; s{<([^>]*)<}{&gt;$1<}g; s{&(?!\w+;|#)}{&amp;}g; print; } __DATA__ <doc><data>></data><data>if( 1 < 2 && 2 < 3)</data></doc>

This doesn't catch the case of an < / > pair that's not part of a tag, as in 'if( $a<$b || $a > $c)'. You can improve this by first trying to catch separately <s, they're easier than >s, as if they are not followed by /?\w+, then they can't be mark-up (once again a simplification, the first character of the tag name can't be a digit).

Also some constructs that might look like entities but are not, like '&#foo', and you could also improve the regexp there. But we are getting to the limits of what's reasonable here.

It all depends of what you want. Limit the number of cases where you have to manually fix the data, or never encounter any well-formedness error.

<pEdited: improved explanations (hopefully!)

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://799575]
[usemodperl]: what the hell happened to the monastery?
usemodperl gets scolded by little old ladies on every dang post
[usemodperl]: did we get invaded by soy boys or what? it wasn't like this 20 years ago :-)
[usemodperl]: i can see the old tgimers hiding out on reddit! lol

How do I use this? | Other CB clients
Other Users?
Others imbibing at the Monastery: (7)
As of 2018-06-24 15:43 GMT
Find Nodes?
    Voting Booth?
    Should cpanminus be part of the standard Perl release?

    Results (126 votes). Check out past polls.