Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

Extra CGI.pm safety by stripping \x00 bytes?

by rlucas (Scribe)
on May 26, 2005 at 19:38 UTC ( #460837=perlquestion: print w/ replies, xml ) Need Help??
rlucas has asked for the wisdom of the Perl Monks concerning the following question:

When reading in CGI form fields from a multilingual, utf8 web application, it is not feasible to use the standard idiom for stripping evil characters:
$string =~ s/[^\w\s\.\,]//g; #plus any other metachars you want # OR $rawstring = m/([\w\s\.\,]+)/; #plus any others... $string = $1;
Since users will be giving me all kinds of high bytes in order to give double-byte utf8 stuff, I need to be more accepting, as I understand it.

However, there persists the CGI Security and the null byte problem issue. Since the null byte can be used to fool various resources, I am tempted to subclass CGI and have the param() method do a s/\x00//g on *everything*. Is this ill-advised -- meaning, might the null byte ever show up in valid utf-8 text?

Remember, CGI uploads of binary files are handled through a different mechanism, so those would not be affected by overriding param(). Does a wise man always strip null bytes from param() returns, and if so, why isn't that the default behavior?

Comment on Extra CGI.pm safety by stripping \x00 bytes?
Download Code
Re: Extra CGI.pm safety by stripping \x00 bytes?
by Zaxo (Archbishop) on May 26, 2005 at 19:47 UTC

    Certainly the null byte can appear in utf-8; code points like \x2400, \x2500, . . . all have them.

    The poison null cracks from perl all occur where C code looks at perl strings and takes them as null-delimited C strings. That typically happens when the string is fed to system and interpreted by the shell. Your caution is justified there, but not as a blanket ban on null bytes.

    Update: Oops! Thanks, guys++, I didn't know that.</blush>

    After Compline,
    Zaxo

      OK - thanks for clarifying that for me. I understood the nature of the crack as described by Ovid in his node (and by others elsewhere on the web). In fact, I'm not anticipating sending anything to system(), and I'm tainting things.

      However, when I send utf8 text to other external C programs (databases, for example, or sendmail), should I take special caution in those cases?

      Certainly the null byte can appear in utf-8; code points like \x2400, \x2500, . . . all have them
      No, utf8 is specifically formulated to avoid null bytes in any codepoints apart from zero, eg
      $ perl586 -MDevel::Peek -e'Dump "\x{2400}"' SV = PV(0x8181f00) at 0x816e234 REFCNT = 1 FLAGS = (POK,READONLY,pPOK,UTF8) PV = 0x817c268 "\342\220\200"\0 [UTF8 "\x{2400}"] CUR = 3 LEN = 4
      That codepoint is represented by three bytes, none of which is zero.

      Dave.

      There's no null byte in the UTF-8 encoding of \x2400 (it's E2, 90, 80). Null bytes shouldn't appear in UTF-8 streams unless actually representing a null character: it's part of the design of UTF-8 that any byte <128 represents itself.
Re: Extra CGI.pm safety by stripping \x00 bytes?
by graff (Chancellor) on May 26, 2005 at 23:15 UTC
    As indicated in the corrections to the initial reply, if you are accepting/expecting strings of utf8 bytes in your CGI query/param string -- and this is all supposed to be character data (as opposed to miscellaneous binary or hacker poison) -- then you should not be expecting any nulls, and can safely filter those out before doing anything else, if you like. This will do no damage to utf8 character data.

    (It's only when you use UTF-16 (BE or LE) that you get null bytes in a unicode stream, and in such cases, the null bytes represent the high bytes of what would otherwise be the plain ASCII+Latin1 set: U0000-U00FF.)

    If you are using Perl 5.8.x and are converting the parameter string to perl-internal utf8 strings (scalars having their "utf8-flag" set, e.g. by using the Encode::decode() method), then your suggested regex will work fine, because "\w" represents all "letters and numbers" (not just the ASCII set of 52, but also the Cyrillic, Greek, Arabic, etc).

    Actually, contrary to past wisdom, it might be easier to specify the set of characters you want to exclude: particularly, ones in the ASCII range that have magical meandings for things like perl regexes, the shell, SQL, etc. This is actually a small and easily specified set, compared to all the miscellaneous multilanguage punctuation that folks might send you, all of which will consist of multi-byte tuples with high bits set, so they can't trigger anything worse than "invalid input" when misused in vulnerable contexts.

    (I've been noticing a lot more people using "wide-character" versions of various quotes and brackets -- this tends to have the side effect of avoiding a variety of vulnerabilities involving the use the ASCII versions of these characters in certain contexts.)

    For that matter, if you're accepting non-ASCII ut8 data via CGI (ASCII is a valid subset of utf8), then you must already be taking care to make sure that this stuff is not misused in your script: if it's going into a database via SQL, then you must be using "?" placeholders in your prepared SQL statements; if it's going into a local file, you must not be using the data to name the file; you surely are not including it in any way in any sort of "system()", backtick or other shell activity, and so on.

    It makes sense to do basic sanity checks on the data (e.g. no null bytes or non-printing ASCII control characters), but beyond that, you shouldn't need to strip it down much to make it "safe", because you shouldn't be doing anything "dangerous" with it in the first place.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://460837]
Approved by holli
Front-paged by ww
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others pondering the Monastery: (10)
As of 2014-07-14 09:59 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    When choosing user names for websites, I prefer to use:








    Results (257 votes), past polls