Big, bad, ugly regex problem

Ovid has asked for the wisdom of the Perl Monks concerning the following question:

Here's the problem: working on an old site with no taint checking (no, you can't have the URL). I am putting the taint checking in there. Unfortunately, users use the templates on the site to update the text on their own sites. They are only allowed to use 5 HTML tags:

<p>
<a...>
<font...>
<br>
<h1> - <h6>

Other than stripping any HTML tag not listed above, I have been asked to allow them to enter anything else they want. The user's input data is used to create an SQL statement that updates a database. Later, this data is displayed on a Web page.

I need to ensure that their input data does not mess with either SQL statements or cause funky things to happen on their web page. Simply allowing a user to enter something as simple as "< in an input box can screw it up (as the next time the page is created, the input box terminates there and the rest of their data spews onto the page).

My (almost functional) solution is presented below. It's a fully working script that allows you to type in sample input and view the HTML and SQL safe output. Once finished, I will add the routine to the site.

#!/usr/bin/perl -w
use strict;

use HTML::Entities;

while (my $data = <STDIN>) {
    exit if $data =~ /q/i;
    print scrubInput($data) . "\n";
}

sub scrubInput {

    # This sub converts potentially harmful characters to their HTML e
+quivalent.
    # Then, it converts &lt; and &gt; around allowed tags back to < an
+d >
    # Finally, it converts dangerous characters in those tags back to 
+normal.
    # Otherwise, users could have things like <A HREF=&quot;somelink$q
+uot;>
    # show up on a page, which obviously is not a valid anchor.
    
    my $data = shift;
    my $unsafe_chars = '!~%^&*\\|"\'<>-_+=?\/;:\[\]{}()\@\$\.';
    
    # Allowed tags
    #
    # All tags should be a regex without the < or > characters
    # Case is irrelevant
    # Append an underscore to the tag if it can have attributes
    # -- Examples --
    # <font size=1>  would be 'font_'
    # To represent the <h1> through <h6> tags, use 'h[1-6]'
    
    my @tags = ('br',
                'p',
                'font_',
                'h[1-6]',
                'a_'
               );

    $data = encode_entities($data, $unsafe_chars);
    
    # Let's substitute back angle brackets that match our allowed tags
    
    foreach my $tag (@tags) {
        # This substitution is for tags that allow additional attribut
+es.
        # The weird negative lookahead takes into account that the fin
+al
        # > has been replaced by &gt;
        
        $tag =~ s/_$/(?:\\s+(?:[^&]|&(?!gt;))+)?/;
        
        # &#47; is the / found in an end tag.  </a> would be encoded a
+s
        # &lt;&#47;a&gt;
        $data =~ s!&lt;(&#47;)?(/?$tag)&gt;! defined $1 ? "</$2>" : "<
+$2>"!gesi;
    }
    
    # Return those bad characters if they are in <a ... > (allowed tag
+s)
    # Otherwise, things like a <A HREF=&quot;somelink$quot;> somelink 
+</a>
    # will bomb
    
    $data =~ s/
                (                       # Capture to $1
                  <a\s                 #     <a and a space character
                  (?:                  #     Non-capturing parens
                    [^>](?!href)       #         All non > not followe
+d by href
                  )*                   #         zero or more of them
                  href\s*              #     href followed by zero or 
+more space characters
                ) 
                (                      # Capture to $2
                  &#61;\s*             #     = plus zero or more space
+s
                  (                    #     Capture to $3
                    &[^;]+;            #         some HTML character c
+ode (probably " or ')
                  )?                   #         which might not exist
                  (?:                  #     Non-grouping parens
                    .(?!\3)            #         any character not fol
+lowed by $3
                  )+                   #         one or more of them
                  (?:
                    \3                 #     $3 
                  )?                   #     (which may not exist)
                )
                (                      # Capture to $4
                  [^>]+                #     Everything up to final >
                  >                    #     Final >
                )
            /$1 . decode_entities($2) .  $4/gsexi;
    return $data;
}
[download]

It's that final regex that is giving me fits. Typing in

<a href="somesite.html">test</a>
[download]

spits out

<a href="somesite.html";>test</a>
[download]

Worse, typing in

<a & href="somesite.html">test</a>
[download]

results in

<a &amp; href &#61; &quot;somesite&#46;html&quot;>
[download]

I've been pulling my hair out over this for quite some time. I need this script to convert the "dangerous" characters back to normal in an anchor so hyperlinks work correctly, but obviously this is not happening. I need another pair of eyes to take a look at it.

And yes, I know this is a bad security model, but I have to implement it. If anyone can think of another way to tackle this problem, I'm all ears!

Cheers,
Ovid

Join the Perlmonks Setiathome Group or just go the the link and check out our stats.

Back to Seekers of Perl Wisdom