Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

Comment on

( #3333=superdoc: print w/ replies, xml ) Need Help??
OK, here is a sample of how I have tackled problems like this in the past. As I said before, I have done this with small parse engines. I really should learn Parse::RecDescent, but to give you a flavour of what can be done, here is my solution to your original problem.

Note the inclusion of closing out tags to create a balanced structure. That is impossible to do with a regex, but IMHO is very valuable.

Also note how the configuration information winds up in a nice data structure. If someone was asked to allow another tag or new attributes, this would be very easy to modify.

Plus I like making mistakes visible...

The key to all of this? Regular expressions have their own control of logic flow with backtracking and all. If you want what it provides, they rock. But they don't scale to conceptually harder problems...

use strict; use vars qw($raw @opened %ok_tag %unbal_tag); use HTML::Entities qw(encode_entities); %ok_tag = ( p => {}, br => {}, a => { accesskey => 1, charset => 1, coords => 1, href => 1, hreflang => 1, name => 1, tabindex => 1, target => 1, type => 1, }, font => { color => 1, face => 1, size => 1, }, h1 => {}, h2 => {}, h3 => {}, h4 => {}, h5 => {}, h6 => {}, ); %unbal_tag = map {($_, 1)} 'br', 'p'; # Takes an input string and returns it. It will leave alone # the tags in %ok_tags if they have only allowed attributes, # and will escape everything else. It will also insert # needed closing tags for tags not in %unbal_tag. I don't # have to do that, but I felt like it since regular # expressions cannot ever solve that problem. # # Oh, and this probably comes with bugs. That is what # you get for free though! :-) { my $raw; my @opened; sub scrub_input { $raw = shift; my $scrubbed = ''; @opened = (); # Grab a chunk of known OK data while ($raw =~ /([\s\w]*)/g) { $scrubbed .= $1; my $pos = pos($raw); # Search for a tag if ($raw =~ m=\G<(/?)([\w\d]+)=g) { my $is_close = $1; my $tag = lc($2); if (exists $ok_tag{$tag}) { if ($is_close) { # closing tag? $scrubbed .= _close_tag($tag); } else { $scrubbed .= _open_tag($tag); } } else { # This tag is not allowed pos($raw) = 0; } } # Escape if last /g match failed unless (pos($raw)) { if (length($raw) == $pos) { # EXIT HERE # return join '', $scrubbed, map "</$_>", reverse @opened; } else { my $char = substr($raw, $pos, 1); pos($raw) = $pos + 1; $scrubbed .= &encode_entities($char); } } } } sub _close_tag { my $tag = shift; # Check a couple of obvious conditions unless ($raw =~ /\G>/g) { # Oops! return ''; } if (exists $unbal_tag{$tag}) { return "</$tag>"; # Not needed but...*shrug* } # OK then, time to figure out which need to be closed my @searched; while (@opened) { my $open_tag = pop(@opened); unshift @searched, $open_tag; if ($open_tag eq $tag) { # Close em! return join '', map "</$_>", reverse @searched; } } # Closing a tag that was not opened? I don't think so! @opened = @searched; pos($raw) = 0; return ''; } sub _open_tag { my $tag = shift; my $allowed = $ok_tag{$tag}; my $text = "<$tag"; while ($raw =~ /\G(?: # Attribute or close \s+([\w\d]+)=("[^"]*"|[^">\s]+) | \s*> )/gx ) { if ($1) { if ($allowed->{lc($1)}) { $text .= " $1=$2"; } else { # Show the bad tag pos($raw) = 0; return ''; } } else { push @opened, $tag; return "$text>"; } } # If I get here, was not well-formed pos($raw) = 0; return ''; } }

In reply to RE (tilly) 1: Regexes vs. Maintainability by tilly
in thread Regexes vs. Maintainability by Ovid

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • Outside of code tags, you may need to use entities for some characters:
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?
    Username:
    Password:

    What's my password?
    Create A New User
    Chatterbox?
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others making s'mores by the fire in the courtyard of the Monastery: (14)
    As of 2014-07-29 18:16 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?

      My favorite superfluous repetitious redundant duplicative phrase is:









      Results (226 votes), past polls