Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"

Big, bad, ugly regex problem

by Ovid (Cardinal)
on Sep 28, 2000 at 02:11 UTC ( #34290=perlquestion: print w/replies, xml ) Need Help??
Ovid has asked for the wisdom of the Perl Monks concerning the following question:

Here's the problem: working on an old site with no taint checking (no, you can't have the URL). I am putting the taint checking in there. Unfortunately, users use the templates on the site to update the text on their own sites. They are only allowed to use 5 HTML tags:
  1. <p>
  2. <a...>
  3. <font...>
  4. <br>
  5. <h1> - <h6>
Other than stripping any HTML tag not listed above, I have been asked to allow them to enter anything else they want. The user's input data is used to create an SQL statement that updates a database. Later, this data is displayed on a Web page.

I need to ensure that their input data does not mess with either SQL statements or cause funky things to happen on their web page. Simply allowing a user to enter something as simple as "< in an input box can screw it up (as the next time the page is created, the input box terminates there and the rest of their data spews onto the page).

My (almost functional) solution is presented below. It's a fully working script that allows you to type in sample input and view the HTML and SQL safe output. Once finished, I will add the routine to the site.

#!/usr/bin/perl -w use strict; use HTML::Entities; while (my $data = <STDIN>) { exit if $data =~ /q/i; print scrubInput($data) . "\n"; } sub scrubInput { # This sub converts potentially harmful characters to their HTML e +quivalent. # Then, it converts &lt; and &gt; around allowed tags back to < an +d > # Finally, it converts dangerous characters in those tags back to +normal. # Otherwise, users could have things like <A HREF=&quot;somelink$q +uot;> # show up on a page, which obviously is not a valid anchor. my $data = shift; my $unsafe_chars = '!~%^&*\\|"\'<>-_+=?\/;:\[\]{}()\@\$\.'; # Allowed tags # # All tags should be a regex without the < or > characters # Case is irrelevant # Append an underscore to the tag if it can have attributes # -- Examples -- # <font size=1> would be 'font_' # To represent the <h1> through <h6> tags, use 'h[1-6]' my @tags = ('br', 'p', 'font_', 'h[1-6]', 'a_' ); $data = encode_entities($data, $unsafe_chars); # Let's substitute back angle brackets that match our allowed tags foreach my $tag (@tags) { # This substitution is for tags that allow additional attribut +es. # The weird negative lookahead takes into account that the fin +al # > has been replaced by &gt; $tag =~ s/_$/(?:\\s+(?:[^&]|&(?!gt;))+)?/; # &#47; is the / found in an end tag. </a> would be encoded a +s # &lt;&#47;a&gt; $data =~ s!&lt;(&#47;)?(/?$tag)&gt;! defined $1 ? "</$2>" : "< +$2>"!gesi; } # Return those bad characters if they are in <a ... > (allowed tag +s) # Otherwise, things like a <A HREF=&quot;somelink$quot;> somelink +</a> # will bomb $data =~ s/ ( # Capture to $1 <a\s # <a and a space character (?: # Non-capturing parens [^>](?!href) # All non > not followe +d by href )* # zero or more of them href\s* # href followed by zero or +more space characters ) ( # Capture to $2 &#61;\s* # = plus zero or more space +s ( # Capture to $3 &[^;]+; # some HTML character c +ode (probably " or ') )? # which might not exist (?: # Non-grouping parens .(?!\3) # any character not fol +lowed by $3 )+ # one or more of them (?: \3 # $3 )? # (which may not exist) ) ( # Capture to $4 [^>]+ # Everything up to final > > # Final > ) /$1 . decode_entities($2) . $4/gsexi; return $data; }
It's that final regex that is giving me fits. Typing in
<a href="somesite.html">test</a>
spits out
<a href="somesite.html";>test</a>
Worse, typing in
<a & href="somesite.html">test</a>
results in
<a &amp; href &#61; &quot;somesite&#46;html&quot;>
I've been pulling my hair out over this for quite some time. I need this script to convert the "dangerous" characters back to normal in an anchor so hyperlinks work correctly, but obviously this is not happening. I need another pair of eyes to take a look at it.

And yes, I know this is a bad security model, but I have to implement it. If anyone can think of another way to tackle this problem, I'm all ears!


Join the Perlmonks Setiathome Group or just go the the link and check out our stats.

Replies are listed 'Best First'.
I wouldn't use a regex here
by tilly (Archbishop) on Sep 28, 2000 at 02:25 UTC
    First I would try Parse::RecDescent and see if you can figure that out.

    If that fails I would use the fact that when you match in scalar context with /g, you can loop over the variable as you parse it. This allows you to create little parse engines. Not as cute for small problems, but it allows one huge regex to turn into a series of small ones and some looping logic, which is much, much better!

    Confession: I don't face this kind of problem often, and I have not faced it since hearing of Parse::RecDescent. So while I think that is a better answer, parse engines are what I have personally done.

      In addition to Parse::RecDescent (which is EXCELLENT --- I am using it 2 hours per day --- it has a ton of awesome features), DCONWAY has also written Text::Balanced which has an HTML/XML tag-parsing function built in.
(dchetlin: HTML::Parser) Re: Big, bad, ugly regex problem
by dchetlin (Friar) on Sep 28, 2000 at 03:00 UTC
    See how this does for you:

    #!/usr/bin/perl -w # vim: filetype=perl use strict; use HTML::Parser; use HTML::Entities; my @tags = map {('(?:\A' . $_ . '\z)')} qw(br p font h[1-6] a); my $tag_RE; { local $" = '|'; $tag_RE = qr/@tags/; } my $unsafe = '^\w\s' my $p = HTML::Parser::->new(api_version => 3); $p->handler(start => \&tag_filter, "tagname, text"); $p->handler(end => \&tag_filter, "tagname, text"); $p->handler(default => sub {print encode_entities(shift,$unsafe)}, "text"); sub tag_filter { print $_[1] if ($_[0] =~ $tag_RE); } local $/; $p->parse(<>);

    Update: I realized that since Ovid seems to want pretty much any special character escaped, it made much more sense to use the negated character class in $unsafe than to have the line noise of all of those special characters and be worried about missing one. It also avoids the typo of having an unescaped `-' in the original that causes all capital letters to be escaped.


(Dermot) Re: Big
by Dermot (Scribe) on Sep 28, 2000 at 02:49 UTC
    A couple of comments. Use a HTML Parser if at all possible. Your brane will thank you in the long run. I think someone has already suggested that. Apart from that though so far I can only see one thing wrong with the regex. You specify an optional open quote with \3 and an optional close quote matching \3 but in between you use \3 with a negative lookahead and for the times when there isn't a \3 bad stuff will happen. I'm not sure really what will happen. IIRC these catch variables are guaranteed to be undefined when you start a new match or substitution.
      Ovid, this is what I came up with after messing with it for a while. It handles the two input strings you were having problems with (quote characters are not optional) but I've no idea if it will work for all possible data. The quote characters optional we can talk about tomorrow.
      #!/usr/bin/perl -w use strict; my ($data, $res); $data = '<a & href="somesite.html">test<\a>'; print "Before substitution: $data\n"; $res = $data =~ s/ ( # Capture to $1 <a and <a\s # a space character ) (?: # Non-capturing parens [^>]* # stuff between a and href ) ( href\s* # href followed by spaces ) ( =\s* # Equals followed by spaces ( ["']+ # Open quote character ) ( [^"']+ # Non open quote character ) (?: \4 # Close quote character ) ) ( > # Not final close angle bracket ) ( [^>]+ # Up to closing angle bracket > # Final close angle bracket ) /$1$2$3$6$7/x; print "no match\n" if ($res eq ""); print "After substitution: $data\n";
      In an /x modified regex the # character is the comment character. The & #61 which represents a space character isn't matching. Instead the & is matching and the #61 to end of line is seen as a comment. Subsequently $3 doesn't match at all due to its optionality and the fact that [^;]+ is greedy.
No q's? (Re: Big, bad, ugly regex problem)
by tye (Sage) on Sep 28, 2000 at 09:08 UTC

    exit if $data =~ /q/i;

    Okay, why is "q" illegal in input? It doesn't seem very dangerous to me! ;-) I think you have a problem with your regex but I think "big, bad, ugly" is overstating it. You probably just wanted something more like:

    exit if $data =~ /^\s*q\s*$/i;

    I couldn't get past this first part so you'll have to rely on the other replies for the rest of it. (:

            - tye (but my friends call me bad names when I act like this)
Re: Big
by runrig (Abbot) on Sep 28, 2000 at 02:50 UTC
    I admit I don't quite get the scope of the whole problem. But as for only allowing certain tags, you could start with:
    #!/usr/local/bin/perl -l -w my $str="<bad tag><a good tag> hello there<br></bad tag></a>"; my @good_tags = qw(p a font br h1 h2 h3 h4 h5 h6); my %good_tags; @good_tags{@good_tags} = (); $str =~ s!(</?(\w*).*?>)!exists $good_tags{lc($2)} ? $1 : ''!eg; print $str;
    You can replace the '$1' by some function to replace characters as you see fit, or capture the '.*?' to $3 and pass it along with $2 to a function to verify whether or not you allow extra attributes with that particular tag. Either way I'd do it in more than one step.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://34290]
Approved by root
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others pondering the Monastery: (2)
As of 2018-02-20 02:28 GMT
Find Nodes?
    Voting Booth?
    When it is dark outside I am happiest to see ...

    Results (266 votes). Check out past polls.