Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic

comment on

( #3333=superdoc: print w/replies, xml ) Need Help??
Here's the problem: working on an old site with no taint checking (no, you can't have the URL). I am putting the taint checking in there. Unfortunately, users use the templates on the site to update the text on their own sites. They are only allowed to use 5 HTML tags:
  1. <p>
  2. <a...>
  3. <font...>
  4. <br>
  5. <h1> - <h6>
Other than stripping any HTML tag not listed above, I have been asked to allow them to enter anything else they want. The user's input data is used to create an SQL statement that updates a database. Later, this data is displayed on a Web page.

I need to ensure that their input data does not mess with either SQL statements or cause funky things to happen on their web page. Simply allowing a user to enter something as simple as "< in an input box can screw it up (as the next time the page is created, the input box terminates there and the rest of their data spews onto the page).

My (almost functional) solution is presented below. It's a fully working script that allows you to type in sample input and view the HTML and SQL safe output. Once finished, I will add the routine to the site.

#!/usr/bin/perl -w use strict; use HTML::Entities; while (my $data = <STDIN>) { exit if $data =~ /q/i; print scrubInput($data) . "\n"; } sub scrubInput { # This sub converts potentially harmful characters to their HTML e +quivalent. # Then, it converts &lt; and &gt; around allowed tags back to < an +d > # Finally, it converts dangerous characters in those tags back to +normal. # Otherwise, users could have things like <A HREF=&quot;somelink$q +uot;> # show up on a page, which obviously is not a valid anchor. my $data = shift; my $unsafe_chars = '!~%^&*\\|"\'<>-_+=?\/;:\[\]{}()\@\$\.'; # Allowed tags # # All tags should be a regex without the < or > characters # Case is irrelevant # Append an underscore to the tag if it can have attributes # -- Examples -- # <font size=1> would be 'font_' # To represent the <h1> through <h6> tags, use 'h[1-6]' my @tags = ('br', 'p', 'font_', 'h[1-6]', 'a_' ); $data = encode_entities($data, $unsafe_chars); # Let's substitute back angle brackets that match our allowed tags foreach my $tag (@tags) { # This substitution is for tags that allow additional attribut +es. # The weird negative lookahead takes into account that the fin +al # > has been replaced by &gt; $tag =~ s/_$/(?:\\s+(?:[^&]|&(?!gt;))+)?/; # &#47; is the / found in an end tag. </a> would be encoded a +s # &lt;&#47;a&gt; $data =~ s!&lt;(&#47;)?(/?$tag)&gt;! defined $1 ? "</$2>" : "< +$2>"!gesi; } # Return those bad characters if they are in <a ... > (allowed tag +s) # Otherwise, things like a <A HREF=&quot;somelink$quot;> somelink +</a> # will bomb $data =~ s/ ( # Capture to $1 <a\s # <a and a space character (?: # Non-capturing parens [^>](?!href) # All non > not followe +d by href )* # zero or more of them href\s* # href followed by zero or +more space characters ) ( # Capture to $2 &#61;\s* # = plus zero or more space +s ( # Capture to $3 &[^;]+; # some HTML character c +ode (probably " or ') )? # which might not exist (?: # Non-grouping parens .(?!\3) # any character not fol +lowed by $3 )+ # one or more of them (?: \3 # $3 )? # (which may not exist) ) ( # Capture to $4 [^>]+ # Everything up to final > > # Final > ) /$1 . decode_entities($2) . $4/gsexi; return $data; }
It's that final regex that is giving me fits. Typing in
<a href="somesite.html">test</a>
spits out
<a href="somesite.html";>test</a>
Worse, typing in
<a & href="somesite.html">test</a>
results in
<a &amp; href &#61; &quot;somesite&#46;html&quot;>
I've been pulling my hair out over this for quite some time. I need this script to convert the "dangerous" characters back to normal in an anchor so hyperlinks work correctly, but obviously this is not happening. I need another pair of eyes to take a look at it.

And yes, I know this is a bad security model, but I have to implement it. If anyone can think of another way to tackle this problem, I'm all ears!


Join the Perlmonks Setiathome Group or just go the the link and check out our stats.

In reply to Big, bad, ugly regex problem by Ovid

Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":

  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?

    What's my password?
    Create A New User
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others lurking in the Monastery: (3)
    As of 2020-10-24 17:40 GMT
    Find Nodes?
      Voting Booth?
      My favourite web site is:

      Results (246 votes). Check out past polls.