Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

Another way to get around automated bots

by AssFace (Pilgrim)
on May 05, 2004 at 14:10 UTC ( #350776=CUFP: print w/ replies, xml ) Need Help??

We have all been to sites where they use various techniques to try to avoid bots using the service intended for human usage (for example Yahoo mail doesn't want spammers to setup bots to send out e-mail). In many of those tests, there is an image that is used - usually of alphanumeric characters that the user then needs to put into a form. The automated bots have started searching the code on the page for the image, then doing what is essentially OCR on the image and then using that data to submit the form.

I have written a way around that which uses only HTML and CSS to represent an image - but if you look at the source of the page, there is no image and no sign of the text that is seen on screen. This is done by treating DIVs like pixels and recreating the image that way.

I have a proof of concept page up here with a static example, a dynamic example (looking at the source code of either of those pages you will see no images and no text that matches what is on screen), and Perl source code for each (static and dynamic).

This technique could also be used on web pages for obfuscating e-mail addresses since the bots can't scan the source to pull out the text of the e-mail address.



-------------------------------------------------------------------
There are some odd things afoot now, in the Villa Straylight.

Comment on Another way to get around automated bots
•Re: Another way to get around automated bots
by merlyn (Sage) on May 05, 2004 at 14:18 UTC

      Yes, dealing with the various disabled users is tough because in order for them to make use of it, they need the computer to see it (and then be spoken in the case of someone that is blind - or sight limited or whatever the current PC terms are), and if the computer can see it - then any bot can see it.
      I guess I should consider myself fortunate that I don't ever have to program for that - every application I have written has been for an environment where sight is assumed.

      The advantage over using an image is that it can't be scanned by a bot and read - if you have an image on a page, the bot can look for the image and then pull the data from the image (easiest way is using neural net training - well, I guess not "easiest" but most effective for varied image types).

      But if there is no image there, then that particular bot can't find anything. The text is also not on the page, so it can't find anything either. The bot then has to parse the appropriate content on the page (which is easy if it is the only thing on the page, harder as you add more content and dynamically change how you reference the classes) and rebuild it as an image, and then do the analysis on it.
      There are ways of making it much harder for the bot to rebuild it.

      Yeah, it is about 10K to represent the same as what a 1K PNG could have done - certainly not ideal for showing images - but this wouldn't be something that you would do on every page either. That is about a 2 to 3 second download for a 33kbps modem user.



      -------------------------------------------------------------------
      There are some odd things afoot now, in the Villa Straylight.
        I guess I should consider myself fortunate that I don't ever have to program for that - every application I have written has been for an environment where sight is assumed.

        I'd be careful about those assumptions (disclaimer: people pay me for accessibility work :-)

        For government or government funded sites in the UK, US and in other countries accessibility is a major issue - contractually or legally depending on locale. For business sites it's becoming a potential legal/PR minefield.

        The advantage over using an image is that it can't be scanned by a bot and read.

        Yes it can. Automating a web browser and a screen grab program isn't hard. With a little more effort they can just parse and interpret the HTML directly.

        The question is - is it worth the effort for somebody to do this on your site.

        The WAI have a nice working paper on the topic Inaccessibility of Visually-Oriented Anti-Robot Tests for those who are interested in the topic.

        Personally I have found heuristic server-side solutions much more effective. For example:

        • Require an response from the user via email
        • Keep an eye out for registrations coming from the same IP/domain
        • Keep an eye out for registrations with similar data
        • Feedback forms with "random" names and a tracking ID to make them do a lot more work to automate the submission.
        • ... I'm sure you get the idea...

        Depending on your application it may be worth thinking how much a captured registration is worth in the currency of your choice, and then thinking about how many registrations a minimum wage worker could make on your site in an hour. If the math comes out the wrong way you're going to have to rethink anyway.

      I was amazed the other day when I set up a PayPal account. I found that in order to set up the account I had to repeat back the numbers that I read from a slightly obscured graphic image.

      Paypal is hardly a "toy" site. My immediate thought was, "Isn't this what merlyn's always talking about?" Unbelievable that they would use such a limiting method to authenticate users.


      Dave

Re: Another way to get around automated bots
by kelan (Deacon) on May 05, 2004 at 14:37 UTC

    I've played around with something like this before, and indeed I think it's pretty cool, and can be done with any image. The one big downside is that the "size" of the image, in terms of downloading it, is much larger than the image you're replacing. With a gzip enabled server, that might not be so bad however.

    On the other hand, although this is a cool hack, I really hope it doesn't catch on. I usually browse the web with images off so I can skip the annoying advertisements that represent probably 80% of web images nowadays. With this technique, there's no good way to turn it off except to disabled displaying divs. And that would probably be a nightmare. I could see unscrupulous advertisers using this technique to get around image blocking and such.

    PS. For some fun playing around with this for any random image, download bmp2html. You can modify the source to spit out colored divs, like your program does, instead of colored ASCII characters.

        In that article it mentions a sample image, col53-fig.gif. Is that available somewhere? I'm curious to see the final output, but I'm too lazy to install ImageMagick and GD :)

Re: Another way to get around automated bots
by Fletch (Chancellor) on May 05, 2004 at 14:56 UTC

    Scuttlebutt I've heard is that the really determined ones are copying the image, tossing it up on another site and having a hyumon read it (who then gets free pr0n or what not), and sending the result back. This scheme would be just as vulnerable to something similar.

    As a somewhat related aside, I was going to submit something similar to the Obfuscated Perl contest a few years back (but didn't because I used GD and the rules excluded using non-core modules.

Re: Another way to get around automated bots
by Anonymous Monk on May 06, 2004 at 07:59 UTC

    Whatever it is, it doesn't work in Opera 7.23 on Windows... I can't read anything in that, on both your static and dynamic pages. And zooming in doesn't help either. What gives?

      Since I don't have Opera on any of my machines, I didn't test it on that (at the end of the write up that the link points to, I note that I only tested it on a limited set of browsers).

      The fellow that created the CSS Pencils test also noted to me that it doesn't work in Opera. It is some bug in the way I did the CSS, but it doesn't mean that it won't work at all - just takes some tweaking.

      But yes, if your browser won't render the DIVs correctly, then you sure aren't going to see much of anything useful.



      -------------------------------------------------------------------
      There are some odd things afoot now, in the Villa Straylight.
Re: Another way to get around automated bots
by andyf (Pilgrim) on May 17, 2004 at 09:01 UTC
    I think that's jolly inventive, even if it's not entirely practical. Of course plain ascii art is a similar tactic.

    Interestingly I looked at the complementary problem last year for a rabble of grubby greyhat dotcommers in the next office to me - you guessed it, OCR for noisy .gifs (they actually did perfectly legitimate deeplinking searches ).

    I used Image::Magik to read, normalise, greyscale, blur and threshold the image, then take the highest weighted sum of the AND with a test image, read nasty brute force OCR.

    Eventully they replaced my code with a far faster C++ implementation that finds minimum distances between FFTs of the images, which quite frankly laughs at Perl (speedwise).

    However they still get plenty of problems last time I heard. That is to say, done properly, obfuscated images can be computationally VERY hard to OCR, but it can be done.

    Regardless of methodolgy there is a deeper principle at play here, which connects with what Merlyn has to say... eventually you are going to make life so difficult for your end user that any perceptual impairment they have will make reading almost impossible. My (dyslexic) Sister has a damn hard time reading those obfuscated .gifs

    My hypothesis then, if you are prepared to throw enough cycles at the problem, with a good enough algorithm, the machine will always be able to filter the info from a noisy image _better_ than a human can. Hence the general method is flawed if its sole objective is to defeat bots.

    A better method is to rely on questions from current events news. Make it multiple choice, and make it so that 3 wrong answers out of 5 blocks the IP for an hour.

    Even something like

    Which dictator has no moustache?
    1 Adolf Hitler
    2) Augustus Pinochet
    3) Saddam Hussain
    4) Josef Stalin
    5) George W Bush
    6) George Palpadopoulos
    7) Francois "Papa Doc" Duvalier


    would fool pretty much any AI :) Andy
      A better method is to rely on questions from current events news. Make it multiple choice, and make it so that 3 wrong answers out of 5 blocks the IP for an hour.

      In these days of proxies using an IP blocking approach is pretty much a dead end. Blocking IPs will mean that you'll kill of groups of people using proxies, and they're so easy to fake only the technically dull bad people will be affected.

      Without a blocking mechanism it then just comes down to a question of odds.

      I also think you'll be surprised at the high false-negative you'll get with real humans getting the questions wrong :-)

        Blocking IPs [...]:, and they're so easy to fake only the technically dull bad people will be affected.

        Wow. It is easy for you to fake an IP and have the results sent back to you? You'll have to explain that before I believe you.

        If you are using IP for security, then the only risk from faking IPs is that someone can send you data with a forged IP in hopes of getting you to act on it. Simply requiring a minimal dialogue that includes repeating hard-to-predict data is enough to make such extremely unlikely.

        An attacker having control over a block of IP adresses is a separate issue.

        - tye        

      If you do implement that multiple question thing, let me know so I can avoid the website, OK? From your list of seven dictators, there are three names I don't recognize, and four names that I wouldn't be able to associate a face with...

      Update: Note that this was a flippant response to your flippant example. :)

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: CUFP [id://350776]
Approved by particle
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others meditating upon the Monastery: (4)
As of 2014-09-24 04:25 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (245 votes), past polls