http://www.perlmonks.org?node_id=598417

JCHallgren has asked for the wisdom of the Perl Monks concerning the following question:

I'm still new to a lot of this...have a CGI script that is designed to accept POST data from a form, which works fine...BUT...spammers are executing my script and passing junk to it, which I'm so far able to trap from going onto output file...so...is there any way to easily block execute of script unless it was invoked by MY webpage/form? FYI: Users of webpage may have various browsers, including some that are quite primitive/limited.
Thanks for any pointers!

Replies are listed 'Best First'.
Re: newb: Best way to protect CGI from non-form invocation?
by radiantmatrix (Parson) on Feb 05, 2007 at 21:49 UTC

    Well, spammers can easily just visit your page with a fake browser and submit the form, so any security measures you take around making sure your web site originates the request will be imperfect.

    • referer checking is where you use something like CGI::Simple's referer method (no, that's not a misspelling) to make sure that a submitter is actually visiting your site before submitting your form. Referer headers can easily be forged, and the spammer can load your form first with their bot program, rendering it ineffective.
    • load-post delays can mitigate some forms of referer-forgery by enforcing a delay between when the form was loaded and when it was submitted. The only way to do this without relying on client-side accuracy is to have the form generated by a script -- you can then use session management to force several seconds to elapse between form load and submit. Good spammer bots defeat this as well.
    • captchas are the common name for a class of tests to verify that a human is posting. The most common implementations are an image-warp of a string which must then be entered in text by the poster. Many captchas can be easily defeated, so choose wisely when selecting an implementation. Also, captchas that don't alienate visually-impaired users are harder to come by (e.g. expensive), so consider that in making your decision.
    • restricted posting involves requiring a user to sign up for an account before posting. Many bots can figure out simple registration schemes, so be sure to use registration security measures (like the above, e-mail confirmations, etc.) for maximum effect.

    None of these techniques is perfect, but using them (especially the Captcha, if appropriate) will eliminate a large number of spams. Most of them are already implemented in CPAN modules.

    <radiant.matrix>
    Ramblings and references
    The Code that can be seen is not the true Code
    I haven't found a problem yet that can't be solved by a well-placed trebuchet

      To add to radiantmatrix's list:

      (most of these are less effective at stopping persistent abuser, but shouldn't stop any valid postings)

      • robots.txt file to stop the innocent search engines from posting
      • one time keys in hidden inputs to track multiple postings to the form without a refresh
      • timestamps in hidden input to track the length of delay between form request and submission. (for some forms, this isn't appropriate, but you can refill the form and ask them to resubmit and/or do something to confirm they aren't a bot)
      • input validation to ensure that the form hasn't been bypassed (eg, make sure select values are options that were on the form)
      • user-agent filtering as there have in the past been signatures of known misbehaving bots, and you might be able to identify a single abusive system/signature
      • rate limiting on all submissions to your system, rather than just a random per-submission delay. (so the more submissions to the site, the longer the delays introduced ... normally to slow down ballot stuffing so that admins can deal with it)

      Oh and for the original poster -- and there are plenty of capchas that don't discriminate against visually-impaired, but may cause problems for some other subset of users. Some simple ones are math problems (arithmetic, not calculus) or 'spot the member that's different' where alt text can work (eg, 8 bird species and a dog breed). I've even seen 'write 2 in the box'. Of course, CAPTCHAs don't work. See If CAPTCHA isn't the answer. What is? for more details.

      Oh -- and a timestamp hashed against the IP address makes a fairly effective combined one time key and timestamp.

        I would strongly recommend against freely mixing languages (e.g. Perl and PHP) in a single application. If you've already got a Perl application, use it to generate the time stamps and insert them in the output (HTML) that is your form.

        Some information that may be of interest in this matter:

        • HTML::Template is an easy way to use template files for HTML; in this case, you could have a template variable wherever you want the timestamp/etc. to appear in your output.
        • the manual pages for time, sprintf, and the POSIX module are probably useful for dealing with times and conversion. Also, a CPAN search for DateTime is informative for any complicated date math. I would keep complexity down until and unless you need it (that's always true, I think).
        • A refresher on CGI::Simple is a good idea as well
        • Read up on the W3C's WWW Security FAQ
        • There's a section of CGI Programming with Perl titled Security that should be helpful
        <radiant.matrix>
        Ramblings and references
        The Code that can be seen is not the true Code
        I haven't found a problem yet that can't be solved by a well-placed trebuchet
        I DO appreciate your lengthy reply...but I have some follow-up questions:
        1) Any suggested coding to implement 'timestamps' as you described? I'm not that sure how one would generate the timestamp...PHP maybe? (I'm starting to look at how to code PHP also) I presume that almost no delay between form send and reply would indicate a non-human, as nobody would type THAT quick, right?
        2) Same question about 'one time keys', but also..I didn't fully follow how this would work...could you explain just a bit more?

        I participated in one of those threads, and as I pointed out in Re: If CAPTCHA isn't the answer. What is?, a CAPTCHA isn't what most people seem to think it is. There are also a few additional suggestions in terms of proving someone is a human.

        I think the most worthy point of the articles you references is that nothing is a perfect solution. SPAM is a fact of life, and anything you do will (a)be a tradeoff and (b)fail to stop all attacks.

        <radiant.matrix>
        Ramblings and references
        The Code that can be seen is not the true Code
        I haven't found a problem yet that can't be solved by a well-placed trebuchet
Re: newb: Best way to protect CGI from non-form invocation?
by ikegami (Patriarch) on Feb 05, 2007 at 21:59 UTC

    A quick and dirty trick is to add a text field (not a hidden field) named subject to your form. Hide this field from your users using CSS (input[name="subject"] { display: none; }). Most spam bots will fill that field. If that field is set, assume the form was submitted by a bot.

    This trick can be used in conjunction with other methods for defense in depth.

      If that field is set, assume the form was submitted by a bot.

      ... or a real user whose CSS settings differ from your expectations.

        That's no biggie. You can include a warning on the form that would be normally hidden by the same mechanism that hid the input field.

        Title: [_________________________] Text: [_________________________] [_________________________] [_________________________] [_________________________] [_________________________] LEAVE EMPTY!! [_] <- "subject" field. Normally hidden by CSS. Only non-CSS clients and overriding CSS clients will see.

        And when you receive a form with the field set, you could republish the form (pre-populated) to the client, asking him to resubmit it with the field empty.

        Thanks to those who have SO quickly replied! :)

        Due to old age and vision of my users, the captcha method is pretty much out...and restricted posting is also undesired, so I'm kinda limited to something behind the scenes...

        Assuming the spam is via a bot, how exactly does it find my form on site? And the data I'm getting is much longer than the field size limits on web page, so they either are using their own variant of my page (which I'd need to try and block) or what? If they are humans typing in spam on my site, then it couldn't be as lengthy as I'm seeing..Or?
Re: newb: Best way to protect CGI from non-form invocation?
by Fletch (Bishop) on Feb 05, 2007 at 21:52 UTC

    Short answer: give up now and improve your validation.

    Longer answer: your form specifies an API in the form of a set of parameters that your CGI program expects to receive as an HTTP POST (or GET, depending on how lenient you are) request. At some point this information will be visible in a format that a determined programmer can write something to submit a request against you. You could try doing some obfuscation by maybe having Javascript which massages and encodes things before submitting that. The problem is you have to give the potential "attacker" that code so they can make legitimate submissions (to say nothing of making it harder-to-impossible for your "primitive/limited" browser users).

    So yes you can make a non-trivial-sized speed bump, but you're only going to keep out the kiddies on tricycles not the determined black hats in four wheel drive vehicles.

Re: newb: Best way to protect CGI from non-form invocation?
by zeno (Friar) on Feb 05, 2007 at 22:39 UTC

    I recently read a blog entry on "The Coding Horror" (http://www.codinghorror.com/blog/archives/000712.html) in which the blogger (Jeff Atwood) explained that he had added an extremely low-tech captcha to his submission form-- the same jpg every time. He finds that for his purposes, this works-- it stops 99.9% of his comment spam in his blog, simply because there is a captcha.

    Granted, it may not be the most sophisticated method, but why not try this before you shell out for a high-powered solution?

      Given that I'm using a website host that would seem to be quite flexible in what options I can have...to the point where they are WAY beyond my skills...back to one original point: Is there something that can be set EXTERNAL to my CGI that would prevent its execution when a POST buffer greater than 3K is passed to it? So that my CGI would never have to deal with data and also prevent DOS(?) attacks?
        You should probably take a look at this. It has a lot of helpful tips, along with answering your question in the first entry.
Re: newb: Best way to protect CGI from non-form invocation?
by imp (Priest) on Feb 05, 2007 at 21:49 UTC
    The most common tactics is to use a captcha, but like all methods that seek to keep bots out (and humans in) it has weaknesses. The comments in this thread discuss this problem.
Re: newb: Best way to protect CGI from non-form invocation?
by merlyn (Sage) on Feb 06, 2007 at 02:22 UTC
Re: newb: Best way to protect CGI from non-form invocation?
by TedPride (Priest) on Feb 06, 2007 at 04:22 UTC
    Bots are often programmed to defeat the most popular validation methods, such as phpBB's graphical validator, but even a simple custom validation will defeat virtually all of them. I just use a randomly generated 6-character hex string that people have to fill in at the bottom of the form, and since I started doing that, I've gone from hundreds of spams to only a single spam submission - and even that one may have been put through by a human.

    The problem with graphics is that a sufficiently obfuscated graphic is also hard for people to see, and if the graphic doesn't load, people can't submit the form. Text is easier to defeat, but anyone who's spending that much effort to defeat your site security specifically can probably come up with much nastier ways to mess with you. Email bombing, or loading your most processor-intensive page hundreds of times per second, etc. Your security only needs to be good enough to stop the usual stupid, impersonal spam bot, but not so good that it irritates your users.

Re: Validation Validation Validation
by kabeldag (Hermit) on Feb 06, 2007 at 05:03 UTC
    Just a general approach to such a situation. Not a comprehensive solution ...

    1. Attempt to validate properties below the Application Layer of the OSI model.

    2. 1.0 Validate IP address and other Transport and Network Layer properties as required.

      1.1 Log Transport and Network Layer connections and scrutinise.

    3. Attempt to validate Application Layer properties of the OSI model.

    4. 2.0 Validate/Authenticate Application Layer/HTTP access to documents. Setup authentication for
             HTTP server usage.

      2.1 Validate Application Layer HTTP header parameters.

    5. Attempt to validate Application Layer connection/session properties and input data.

    6. 3.0 Validate Application layer document specific access. Session id token combined with IP
             address and or other parameters via some sort of encoding technique. A user/password auth
             combination with perhaps 'CAPTCHA' techniques to validate the
             user login.

      3.1 Validate session form input. Use hidden form values, data length checks, valid characters and or
             words. Validate input times so that input doesn't come too quick or too late.

      3.2 Log important events such as logins and form input with appropriate client/session data.

      3.3 Scrutinise event logs ... form input/submissions, authentication etc.

    Logging events is important for validation as well as other reasons. If a determined or experienced bastard
    passes all of the validation checks, you can always check the the logs for patterns. Random or not, you
    will notice patterns and can take appropriate action. Beware of blocking/black-listing certain IP's though, you may end up blocking a completely okay network because somebody spoofed an IP or block.

    It is easy to spoof Network and Transport Layer packet properties, as well as Applicaiton layer properties such
    as the document referrer etc, but if you have no validation, you have no security.

    By no means have I listed every possible validation method nor may I be 100% on target.
    I have just listed a general layered overview. There are suggestions already mentioned in this thread/node,
    but don't stop there. Think about the type of situation you have and apply a security measure to match. Common sense.

    Update (:-s) : Fixed some HTML formatting