Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

An internet garbage filter

by pg (Canon)
on Oct 26, 2003 at 20:36 UTC ( #302259=CUFP: print w/ replies, xml ) Need Help??

use threads; use Socket; use strict; use warnings; my $banned_type = { "cab" => 1, "class" => 1, "dat" => 1, "exe" => 1, "gif" => 1, "ico" => 1, "js" => 1, "swf" => 1 }; my $banned_site = { "ad.doubleclick.net" => 1, "lz.mainentrypoint.com" => 1, "search-itnow.com" => 1, "www.ftstock.com" => 1, "www.newshub.com" => 1 }; use constant RES_400 => "HTTP/1.1 400 Bad Request\r\n\r\n"; my $proto = getprotobyname('tcp'); socket(BROWSER_LISTENER, PF_INET, SOCK_STREAM, $proto) || die "Failed +to create socket: $!"; setsockopt(BROWSER_LISTENER, SOL_SOCKET, SO_REUSEADDR, pack("l", 1)) | +| die "Failed to setsockopt: $!"; bind(BROWSER_LISTENER, sockaddr_in(8080, INADDR_ANY)) || die "Failed t +o bind: $!"; listen(BROWSER_LISTENER, SOMAXCONN) || die "Failed to listen: $!"; print "Internet filter started\n"; while (1) { my $browser; accept($browser, BROWSER_LISTENER); my $req; my $chunk; do { } until (!sysread($browser, $chunk, 10000) || ($req .= $chunk) =~ +m/\r\n\r\n/); my $host = ($req =~ m/Host:\s*(.*?)\r/)[0]; my ($method, $page) = ($req =~ m/^(.*?)\s+(.*?)\s/); print "Received request for [$method, $page]\n"; if ($host && $page) { if (is_banned_site($host) || is_banned_type($page)) { #print "[$host, $page] is banned\n"; print $browser RES_400; close($browser); } else { threads->create(\&process_one_req, $browser, $req, $method +, $host)->detach(); } } else { close($browser); } } sub process_one_req { my ($browser, $req, $method, $host) = @_; my $iaddr = inet_aton($host) || die "no host: $host"; my $paddr = sockaddr_in(80, $iaddr); $proto = getprotobyname('tcp'); my $remote; socket($remote, PF_INET, SOCK_STREAM, $proto) || die "socket: $!"; connect($remote, $paddr) || die "connect: $!"; print $remote $req, "\r\n"; my $chunk; while (sysread($remote, $chunk, 10000)) { print $browser $chunk; } close($remote); undef($remote); close($browser); undef($browser); undef($req); undef($host); } sub is_banned_site { my $site = shift; return 1 if (exists($banned_site->{$site})); if ($site =~ m/offeroptimizer/) { return 1; } if ($site =~ m/revenue.net$/) { return 1; } if ($site =~ m/popupsponsor.com$/) { return 1; } if ($site =~ m/hitbox.com$/) { return 1; } return 0; } sub is_banned_type { return (exists($banned_type->{(split(/\./, shift))[-1]})) ? 1 : 0; }

Comment on An internet garbage filter
Download Code
Replies are listed 'Best First'.
Re: An internet garbage filter
by DrHyde (Prior) on Oct 27, 2003 at 09:54 UTC
      No, use Privoxy, the successor to Junkbuster that does much more than just URL blocking.

      Makeshifts last the longest.

Re: An internet garbage filter
by hossman (Prior) on Oct 27, 2003 at 07:04 UTC

    Haven't tried running it, but i would suggest converting your banned_sites hash into an array of regexes ... that way you don't have the seperate lists of fixed hosts in the hash, and host regexes in the is_banned_site method.

      For anyone use this program, my suggestion is to have a big hash for banned sites, but only a much smaller array for banned sites expressed with regexp.

      For example, if there are four sites you want to ban:

      • a.foo.com
      • b.foo.com
      • c.foo.com
      • d.foo.com

      It is better to put them all in the hash for fixed site, instead of using regexp, unless that site has a rich variety of names. If it only has three or four different names, put them in hash.

Re: An internet garbage filter
by dakkar (Hermit) on Oct 30, 2003 at 13:24 UTC

    I started with JunkBuster, then noticed its limitations (HTTP 1.0 only, mainly), and looked for something better.

    Now I'm using Squid with the AdZapper redirector. Works like a charm: automatic updating of the pattern list, quick, written in Perl...

    -- 
            dakkar - Mobilis in mobile
    

    Most of my code is tested...

    Perl is strongly typed, it just has very few types (Dan)

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: CUFP [id://302259]
Approved by davido
Front-paged by davido
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others avoiding work at the Monastery: (11)
As of 2015-07-28 09:29 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (254 votes), past polls