Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling

Untainting safely. (b0iler proofing?)

by BrowserUk (Patriarch)
on Jun 25, 2002 at 19:39 UTC ( [id://177184]=perlquestion: print w/replies, xml ) Need Help??

BrowserUk has asked for the wisdom of the Perl Monks concerning the following question:

Yesterday, while casting around looking for 'the right way' (Yes! I know, TMTOWTDI! So less say 'a right way') to do some things, I encountered a link her that led me to read this. Now, without getting into the debate of the merits of the author's style or motivations nor even the specific details of the article, it did serve to highlight several weaknesses in my own treatment of 'external input' and my attempts to 'sanatise' and untaint it.

I've looked around PM looking for a 'standard', tested way of acheiving this (what we might call b0iler proofing with the risk of the collective ire for a bad pun and conferring undue notoriety).

It seems to me that what is needed (I need) is a subroutine, that takes a string and removes all 'unsafe' (meta) characters and character sequences.

My thoughts on writing this are:

  • Don't use regex's for the parsing - I also recently discovered that even experienced monks can have trouble getting these right.
  • Don't do it in a method that would allow embedded escaping (or nulls etc) to be processed.
  • Allow as many others to review the code as possible in the hope of it becoming 'well refined'

I've had a couple of attempt at doing this. I started looping over the string and inspecting each char individually using ord() and comparing against a list of 'known values'.

I then thought of unpack()ing the string to ensure that Perl wouldn't do any magical escaping.

But my Perl skills so far are such that I'm reluctant to trust my own code (and even more reluctant to offer it here for public review again :o), so...

My question:

Would you kind people care to share your code to acheive the aims above or point me at code that will acheive those aims?

Offer your input to extending those aims.


Edit by dws to fix tags

Replies are listed 'Best First'.
Re: Untainting safely. (b0iler proofing?)
by Fastolfe (Vicar) on Jun 25, 2002 at 19:45 UTC
    It depends entirely on the application. We don't know what "unsafe" is without knowing the context. If you're talking about shell meta-characters, it depends on which shell you're using (and which shell the user will be using), and should be relatively moot if you use the multiple-argument form of calls like system and exec, which wouldn't do any shell expansion anyway. If you're talking about unsafe text in HTML, we have things like HTML::Entities.

    Basically, identify what you're going to be doing with the data, and then figure out how you're going to ensure that this untrusted data is safe.

    And no matter how you approach it, don't think of your algorithm as being built to remove bad things. Build it to permit safe things. If this means doing a tr/a-zA-Z0-9_-//cd, then that's what you have to do.

      I think the last paragraph should be highlighted. Do not remove bad things. Permit safe things.

      A few weeks ago in a reply to someone in on similar topic I wrote:

      1. There is NO single list of dangerous characters. What characters are dangerous depends on the action you do with the data.
      2. If you or someone else creates a list of suspicious characters and test whether the data contain any of them, you are NOT safe. It's for sure you'll forget some character, it's for sure there is something you've never heard of that can go wrong.
      3. Always test whether the data DO CONTAIN ONLY ALLOWED characters. And allow only the characters you must.


•Re: Untainting safely. (b0iler proofing?)
by merlyn (Sage) on Jun 25, 2002 at 19:49 UTC
    takes a string and removes all 'unsafe' (meta) characters
    That's too broad. What's unsafe to the shell is not unsafe to an email address, and vice versa.

    And contrary to what I picked up from skimming that long article, the best way to keep the shell from interpreting unsafe characters is to not even use a shell at all! Most child process invocations can use a shell-less invocation (multiple arguments to system or exec), and then there's never a problem with the potential characters in the first place!

    So, while I understand what you are trying to do, I don't understand why you are even trying to do it. You're starting at the wrong end of the picture.

    -- Randal L. Schwartz, Perl hacker

      I think what BrowserUK might be asking for is a library that can be used to handle the "usual" types of data. I'm interested in such a thing, as time and time again, people are untainting the same kinds of data.

      Admittedly, such a module would never be strictly perfect, but it would be arguably better than the current condition. Positive progress is better than nothing, isn't it?

      What about validators or untainters for such common bits of data as:
      • Dates
      • Times
      • Names
      • Integers
      • Phone Numbers
      • Postal Codes
      • URLs
      • E-mail addresses
      • Free-form comments
      These are, of course, vague concepts at best, but as an example, as wild and wooly as telephone numbers get, they usually don't have ampersands, double-quotes, or backslashes in them. Likewise, e-mail addresses can contain quite a lot of hair, but there are certain characters that are just not valid.

      The idea is not that things are not necessarily valid names, phone numbers or e-mail addresses, but that they are at least, not dangerous or radioactive. In other words, a name of "Smith'; DROP TABLE foo;" would not be valid.

      Just an idea.
        URLs and email addresses are never "unsafe" when handled safely. So if you're looking for Email::Valid, it's done.

        And why you would be passing a date, time, or name near a shell. I'm still confused. That's still thinking from the wrong end.

        As for your DROP TABLE example, if you are using placeholders correctly, that value wouldn't matter.

        So, I'm still not convinced that there needs to be a standard "untainting" library. When the data is handled properly, we don't need to "match" "safe" data. Period.

        -- Randal L. Schwartz, Perl hacker

Re: Untainting safely. (b0iler proofing?)
by BrowserUk (Patriarch) on Jun 25, 2002 at 22:40 UTC

    My interpretation of the referenced article was that the biggest problems were caused by several factors:

    • Not all programmers are, nor could be, 'experts' in understanding all the cracks in the individual or collective amours' of Perl's calls to system functions, or those of all the systems that Perl run's on.
    • Even experts make mistakes.
    • Re-using (and reusability) are good. CPAN is one of Perl's great strengths. The problem is, when the programmer passes values to modules, he does not know exactly how those values will be used. - Sure, he can read the source, but that negates half the benefit of reuse.
    • Doing sufficient screening (or just enough allowing) for the required use of input is fine, but what happens in 6 months or a year when the functionality needs to be extended? Will the maintenance programmer know or understand what sanitising was done and why? The article had an interesting section that showed that how consecutive filtering was order dependant. Programmer 1 gets input, does appropriate filtering for the planned use, and uses it. Programmer 2 comes along a month or 6 later with a requirement for new functionality. Goes in, grabs the value already untainted - and uses it.

      I don't think "only employ competent programmers" cuts it here!

    As an example, my current project (its only a learning exercise at this point, so please don't pollute the thread by critiquing the project design), uses .xml files to describe 'things' and these are used to build the HTML for displaying. The user selects the 'thing' of interest by clicking on a menu. The identifier of the 'thing' is passed as a URL search parameter and then that identifier is used to build the filename of the .xml file that is opened. The path/filename constructed is then passed to XML::Simple to read and process. The menu's that the user clicks on are themselves generated by processing input from readdir().

    The aim is that new things and groups of things can be added to the site by simply dropping a new .xml file in the appropriate directory and creating new directories respectively. Updates to existing things would be done by editing the .xml (and then validating before putting (back) into the production environment). The idea being that you don't need to code new HTML to add/delete/update 'thing' pages - you edit the .xml, validate it against a custom DTD and move/copy it to the appropriate place and all the layout of the HTML is taken care of by an intelligent Perl script. (Once I get up to speed enough to write such a beast.:)

    This simple, small sample of the project implementation raises (at least) the following questions:

    The following are only example questions I am not seeking answers to them here!

    • What does XML::Simple do with the parameter I pass?
    • How restrictive should I be on allowable values for filename chars should I be? What happens if later on this is seen as "too restrictive" and the rules get modified?
    • If the user embeds an url-escaped null after the product identifier and I haven't checked for embedded null chars (I hadn't!) and I pass this alone to XML::Simple, will it be used in a way that could be vulnerable?

    I know there are more questions.

    In answer to Merlyn's statement:...I don't understand why you are even trying to do it... and other statements about context...

    It struck me that there are only a limited (and relatively few) modes of possible exploitation /failure for external data. However, these points of exploitation /failure can be spread throughout a project. Many of you monks will have already "rolled your own" solutions to some or all of these - perhaps many times.

    It seemed to makes sense both from factorisation and maintenance points of view, to handle the screening and untainting of 'external input' in a centralised manner. That way if (when?) new failures and exploits are described or occur, the required modifications only need to be done once, in one place.

    To this end, I thought a sub (module?) called (for example) sanitise() (or sanitize() if you prefer:), that "takes care" of this was appropriate.

    The input would be the tainted string, the return as appropriate. Given the discussion thus far regarding context, perhaps a second parameter would be a constant chosen to define what type of sanitisation was required. eg.

    • PATH - perhaps RELATIVE_PATH | ABSOLUTE_PATH would be necessary?
    • HTML

    Maybe I am "looking in the wrong places" or "don't understand the problem"? Maybe I am trying to factor something that is either too complicated or too trivial to be factored? Maybe I am reacting in paranoia, or just 'knee jerking' in response to the referenced article, but I felt that the strongest conclusion to draw from the article was that DIY sanitisation (even by security experts) of external input was the biggest source of vuln and exploits and I thought that this was an appropriate way solving that problem.

    Sorry this got so long, but re-reading it, there is nothing that I feel should be left out!
      Again, I've got to emphasize. There is no such thing as "unsafe data". Merely "data used unsafely". So a hypothetical sanitize routine could at best be written as:
      sub sanitize { die "If you had to call me, you've lost already"; }
      You must fix the behavior of your code, not wrestle your data to the floor.

      -- Randal L. Schwartz, Perl hacker

Re: Untainting safely. (b0iler proofing?)
by tadman (Prior) on Jun 26, 2002 at 00:03 UTC
    I'm completely stunned that you'd suggest not using regexes for parsing. They might be hard to get "right", but I assure you, getting the same effect with ord and unpack is going to be a struggle you don't want to pursue.

    Peer review and a huge number of test cases, especially those culled from real-world experience, can help make your validation routine more robust. For example, check through your current database and make sure everything passes before unleashing your validator on the Web site. It's really unpleasant to find out that Really Important Customer XYZ can't post their Really Big Order because their part number has a dash in it, and your validator rejects that as invalid.

      I'm completely stunned that you'd suggest not using regexes for parsing

      Did you read the article? if not, please take the time to scan it (search for "s/" to get to the relavent sections) and perhaps you'll see why this method (ord() and unpack())seemed appealing. The way the interpolation that regexs do can be exploited to bypass even the most sophisticated set of multiple passes with regex to sanitise a user supplied path is simply scary.

      Peer review and a huge number of test cases, especially those culled from real-world experience

      Exactly why I was suggesting development of such code here! There are very few huge corporations and many millions of small companies in the world. The big ones have the budgets for such in-depth, in-house development, review and expertise. The rest have often one or two developers who are responsible for developing and maintaining the code. No possibility of enlisting more than there own expertise in reviewing their own work. And whilst when the big ones make mistakes, they have the have the funds to correct them. When the small ones make mistakes, the finacial costs of correction are often too much for their small net worths to bear and they go under taking the jobs they provided with them. Permenantly.

      Expertise takes either time or money. Those that have invested the time, charge substantially to hire that expertise to others. The big guys have the money to grow that expertise internally or buy it externally. They are still making mistakes. The small guy has neither choice.

      I don't understand why the idea of utilising the collective resourses of PM to address and simplify the process of handling security--the one thing that (as I have seen all over PM) is at the top of almost every single IT experts', of any flavour, list of major priorities--is so shocking?

      BrowserUK (mistakenly posted anonymously)

      Added attributio - dvergin 2002-06-28

        The small guy has neither choice.
        And there's where you're wrong again.

        It's the cost of doing business plain and simple. If you can't include security as part of your budget and still make a profit, you've got a bad business plan.

        It's just as wrong to cheat on security costs as it is to cheat on paying Uncle Sam his share. If you wouldn't even dream of the latter, why are you even debating the former?

        -- Randal L. Schwartz, Perl hacker

        Please accept my apologies. I have just realised than when I made the post to which this is a reply, and this earlier post, I was was not logged in. So they have come up as Anonymous Monk not as me. I don't know how (or if) this can be corrected?

        If not, I offer this post for any that feel the need for --.

Re: Untainting safely. (b0iler proofing?)
by Ryszard (Priest) on Jun 26, 2002 at 18:59 UTC
    I think what merlyn and fastolfe are getting at is not all the data parsed to your cgi (tm) requires untainting.

    For example if you accept a $q->param and send it back to the browser as a "Hello world" example, save for a 20Mb input, why would you need to untaint it?

    If on the other hand, if you are calling the shell, you need to check for naughty characters.

    So untainting your data depends completly on the context in which your application is developed.

    I think you *may* be getting confused between untainting and data validation. There is a difference, the former making sure there are no naughty characters in your input, the later checking the data conforms to a standard.

    Anyway's FWIW, i think a centralised method of data validation would be cool, however i dont think untainting can really be centralised outside of the context of the application for which its developed.

      Respectfully, I thought that your "Hello world" example, or more classically, the "Hello Ryzard, welcome back" using your name supplied from a form was benign until I read this - but when you see that by embedding HTML and script tags into the name field can, when returned to the browser for display, open up a wealth of possibilities of cross-site scripting and cookie theft, it made me think again.

      Beleive me, I am not mixing data validation and untainting up. Data validation is very much an application specific function. An telephone number or zip code validation routine written for US numbers/ZIP's would have no application here in the UK.

      However, sanitising almost any external input has universal application. the same hacks and cracks that would affect your server will (in most cases) affect my server too.

      As I wrote elsewhere, there are very few uses of external data that are cause for concern - opens, commands, database entry, re-display, passing to other modules - very few more. The hacks that are possible in each of these cases are limited and the fixes/preventions should be pretty much the same wherever the program is destined to run. Its also much harder, and requires much greater experience to prevent the "Reverse Directory Transversal" vuln than it is to validate a date or a ZIP or telephone number.

      The latter is a fairly standard programming problem.

      The former, as bugtraq prooves, is a much harder and requires much greater real world expertise.

      Hence my beleif that it is a ripe candidate for standardisation.

      However, it seems that I am in a minority and/or 'nih' syndrome is at play here :(

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://177184]
Approved by jsprat
Front-paged by sparkyichi
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others wandering the Monastery: (5)
As of 2024-04-13 23:27 GMT
Find Nodes?
    Voting Booth?

    No recent polls found