Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic

Re: Regular expression to replace xml data

by mirod (Canon)
on Oct 06, 2009 at 19:59 UTC ( #799575=note: print w/replies, xml ) Need Help??

in reply to Regular expression to replace xml data

This is not an easy problem. It's quite easy to get partial solutions, and real hard to get a perfect one. The good news is that as long as you are conservative in what you fix, then the XML parser will tell you about what you missed and no one will be hurt in the process ;--)

Also I would assume that what you get is not pathological, designed to trip the parser, but more like "XML by dummies", who don't know the spec, or what a parser is. So probably no CDATA section, no comments, no '>' in attribute values.

My first attempt would look like this:

If we find 2 successive '>' without a '<' in between, then the second '>' should be turned into an entity (the first one closes a tag, but not the second one). Same with 2 successive '<' without a '>' in between, the first '< is not part of the markup (the second one opens a tag, but not the first one). For &, if it doesn't look like an entity, &name; or &#..., then turn it into &amp;

#!/usr/bin/perl use strict; use warnings; while( <DATA>) { s{>([^<]*)>}{>$1&gt;}g; s{<([^>]*)<}{&gt;$1<}g; s{&(?!\w+;|#)}{&amp;}g; print; } __DATA__ <doc><data>></data><data>if( 1 < 2 && 2 < 3)</data></doc>

This doesn't catch the case of an < / > pair that's not part of a tag, as in 'if( $a<$b || $a > $c)'. You can improve this by first trying to catch separately <s, they're easier than >s, as if they are not followed by /?\w+, then they can't be mark-up (once again a simplification, the first character of the tag name can't be a digit).

Also some constructs that might look like entities but are not, like '&#foo', and you could also improve the regexp there. But we are getting to the limits of what's reasonable here.

It all depends of what you want. Limit the number of cases where you have to manually fix the data, or never encounter any well-formedness error.

<pEdited: improved explanations (hopefully!)

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://799575]
[james28909]: i honestly didnt thin kanyone was going to respond, or was even paying attention :l
[Eily]: well if you say something like that someone has to pay attention by law
[Eily]: Murphy's law to be precise
LanX Anonymous Monks meeting
holli brings cookies to the meeting
[stonecolddevin]: good morning all
[holli]: I'd bring wine but I am dry since two years, so cookies must do
[1nickt]: james28909 I enjoyed your metaphysical wonderings in tghe scientist thread. Also thought your comment about baby crispr students was comment of the year.
Eily brings the milk

How do I use this? | Other CB clients
Other Users?
Others romping around the Monastery: (9)
As of 2017-12-13 17:45 GMT
Find Nodes?
    Voting Booth?
    What programming language do you hate the most?

    Results (373 votes). Check out past polls.