OK, here is a sample of how I have tackled problems like
this in the
past. As I said before, I have done this with small parse
engines. I really should learn Parse::RecDescent, but to
give you a flavour of what can be done, here is my solution
to your original problem.
Note the inclusion of closing out tags to create a balanced
structure. That is
impossible to do with a regex, but IMHO is very valuable.
Also note how the configuration information winds up in a
nice data structure. If someone was asked to allow another
tag or new attributes, this would be very easy to modify.
Plus I like making mistakes visible...
The key to all of this? Regular expressions have their own
control of logic flow with backtracking and all. If you
want what it provides, they rock. But they don't scale to
conceptually harder problems...
use strict;
use vars qw($raw @opened %ok_tag %unbal_tag);
use HTML::Entities qw(encode_entities);
%ok_tag = (
p => {},
br => {},
a => {
accesskey => 1,
charset => 1,
coords => 1,
href => 1,
hreflang => 1,
name => 1,
tabindex => 1,
target => 1,
type => 1,
},
font => {
color => 1,
face => 1,
size => 1,
},
h1 => {},
h2 => {},
h3 => {},
h4 => {},
h5 => {},
h6 => {},
);
%unbal_tag = map {($_, 1)} 'br', 'p';
# Takes an input string and returns it. It will leave alone
# the tags in %ok_tags if they have only allowed attributes,
# and will escape everything else. It will also insert
# needed closing tags for tags not in %unbal_tag. I don't
# have to do that, but I felt like it since regular
# expressions cannot ever solve that problem.
#
# Oh, and this probably comes with bugs. That is what
# you get for free though! :-)
{
my $raw;
my @opened;
sub scrub_input {
$raw = shift;
my $scrubbed = '';
@opened = ();
# Grab a chunk of known OK data
while ($raw =~ /([\s\w]*)/g) {
$scrubbed .= $1;
my $pos = pos($raw);
# Search for a tag
if ($raw =~ m=\G<(/?)([\w\d]+)=g) {
my $is_close = $1;
my $tag = lc($2);
if (exists $ok_tag{$tag}) {
if ($is_close) { # closing tag?
$scrubbed .= _close_tag($tag);
}
else {
$scrubbed .= _open_tag($tag);
}
}
else {
# This tag is not allowed
pos($raw) = 0;
}
}
# Escape if last /g match failed
unless (pos($raw)) {
if (length($raw) == $pos) {
# EXIT HERE #
return join '', $scrubbed, map "</$_>", reverse @opened;
}
else {
my $char = substr($raw, $pos, 1);
pos($raw) = $pos + 1;
$scrubbed .= &encode_entities($char);
}
}
}
}
sub _close_tag {
my $tag = shift;
# Check a couple of obvious conditions
unless ($raw =~ /\G>/g) {
# Oops!
return '';
}
if (exists $unbal_tag{$tag}) {
return "</$tag>"; # Not needed but...*shrug*
}
# OK then, time to figure out which need to be closed
my @searched;
while (@opened) {
my $open_tag = pop(@opened);
unshift @searched, $open_tag;
if ($open_tag eq $tag) {
# Close em!
return join '', map "</$_>", reverse @searched;
}
}
# Closing a tag that was not opened? I don't think so!
@opened = @searched;
pos($raw) = 0;
return '';
}
sub _open_tag {
my $tag = shift;
my $allowed = $ok_tag{$tag};
my $text = "<$tag";
while ($raw =~ /\G(?: # Attribute or close
\s+([\w\d]+)=("[^"]*"|[^">\s]+)
|
\s*>
)/gx
) {
if ($1) {
if ($allowed->{lc($1)}) {
$text .= " $1=$2";
}
else {
# Show the bad tag
pos($raw) = 0;
return '';
}
}
else {
push @opened, $tag;
return "$text>";
}
}
# If I get here, was not well-formed
pos($raw) = 0;
return '';
}
}
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
|
For: |
|
Use: |
| & | | & |
| < | | < |
| > | | > |
| [ | | [ |
| ] | | ] |
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.