in reply to Re: blocking site scrapers
in thread blocking site scrapers

Nice idea but referrers can be forged as well.


Replies are listed 'Best First'.
Re^3: blocking site scrapers
by jhourcle (Prior) on Feb 09, 2006 at 15:01 UTC

    Yes, they can -- but if someone's scraping the site, they'd have been referred by the site in question to get to the image.

    Checking HTTP_REFERER is for those cases when someone from another website decides to link directly to an image (and/or page) on your site. Back in the early days of HTTP (ie, 0.9, before there was such a thing as HTTP_REFERER), it was common for people to link to our imagemap and counter CGIs that ran on the server that I maintained -- they didn't care, and there was no real way to stop them.

    Likewise, people would find an image they liked (a bullet, some animated gif, whatever), and would link directly to it, sucking down your bandwidth. (the university where I worked only had a T1 in 1994)

    These days, however, when people check HTTP_REFERER, it's not to stop bots -- it's to stop people from linking directly to the images, so that other people visiting their site use someone else's bandwidth. As they don't have control over the other people's browsers, checking HTTP_REFERER can be a very effective way to cut down on abuse -- however, as not all browsers send HTTP_REFERER, you have to make sure that the null case is to allow the download.


    I'm also surprised that no one's mentioned checking X_FORWARDED_FOR to check for proxies (which should have identified the issue w/ AOL, as well as SQUID and quite a few other proxies) ... there were also some proposals floating about for changing the robot exclusion standards to specify rate limiting and visiting hours, but it's been a decade, and I've never seen any widespread support for them

      I wouldn't call that scraping a site though. That's just stealing bandwidth and any well built ISP or Host would have link protection in place. cPanel even has that built in to their console. So adding that may be a bit redundent if thats there already. I'm thinking he didn't mean bandwidth stealers but people who use a program specificaly made for taking all the images on a site and downloading them, hammering a site for however long it takes to get all the pics. Programs like that usualy forge the referrer to make make it look like the site itself referred the program. The trap is the way to go to catch those things since most never obey robots.txt.