Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

An internet garbage filter

by pg (Canon)
on Oct 26, 2003 at 20:36 UTC ( [id://302259]=CUFP: print w/replies, xml ) Need Help??

use threads; use Socket; use strict; use warnings; my $banned_type = { "cab" => 1, "class" => 1, "dat" => 1, "exe" => 1, "gif" => 1, "ico" => 1, "js" => 1, "swf" => 1 }; my $banned_site = { "ad.doubleclick.net" => 1, "lz.mainentrypoint.com" => 1, "search-itnow.com" => 1, "www.ftstock.com" => 1, "www.newshub.com" => 1 }; use constant RES_400 => "HTTP/1.1 400 Bad Request\r\n\r\n"; my $proto = getprotobyname('tcp'); socket(BROWSER_LISTENER, PF_INET, SOCK_STREAM, $proto) || die "Failed +to create socket: $!"; setsockopt(BROWSER_LISTENER, SOL_SOCKET, SO_REUSEADDR, pack("l", 1)) | +| die "Failed to setsockopt: $!"; bind(BROWSER_LISTENER, sockaddr_in(8080, INADDR_ANY)) || die "Failed t +o bind: $!"; listen(BROWSER_LISTENER, SOMAXCONN) || die "Failed to listen: $!"; print "Internet filter started\n"; while (1) { my $browser; accept($browser, BROWSER_LISTENER); my $req; my $chunk; do { } until (!sysread($browser, $chunk, 10000) || ($req .= $chunk) =~ +m/\r\n\r\n/); my $host = ($req =~ m/Host:\s*(.*?)\r/)[0]; my ($method, $page) = ($req =~ m/^(.*?)\s+(.*?)\s/); print "Received request for [$method, $page]\n"; if ($host && $page) { if (is_banned_site($host) || is_banned_type($page)) { #print "[$host, $page] is banned\n"; print $browser RES_400; close($browser); } else { threads->create(\&process_one_req, $browser, $req, $method +, $host)->detach(); } } else { close($browser); } } sub process_one_req { my ($browser, $req, $method, $host) = @_; my $iaddr = inet_aton($host) || die "no host: $host"; my $paddr = sockaddr_in(80, $iaddr); $proto = getprotobyname('tcp'); my $remote; socket($remote, PF_INET, SOCK_STREAM, $proto) || die "socket: $!"; connect($remote, $paddr) || die "connect: $!"; print $remote $req, "\r\n"; my $chunk; while (sysread($remote, $chunk, 10000)) { print $browser $chunk; } close($remote); undef($remote); close($browser); undef($browser); undef($req); undef($host); } sub is_banned_site { my $site = shift; return 1 if (exists($banned_site->{$site})); if ($site =~ m/offeroptimizer/) { return 1; } if ($site =~ m/revenue.net$/) { return 1; } if ($site =~ m/popupsponsor.com$/) { return 1; } if ($site =~ m/hitbox.com$/) { return 1; } return 0; } sub is_banned_type { return (exists($banned_type->{(split(/\./, shift))[-1]})) ? 1 : 0; }

Replies are listed 'Best First'.
Re: An internet garbage filter
by DrHyde (Prior) on Oct 27, 2003 at 09:54 UTC
      No, use Privoxy, the successor to Junkbuster that does much more than just URL blocking.

      Makeshifts last the longest.

Re: An internet garbage filter
by hossman (Prior) on Oct 27, 2003 at 07:04 UTC

    Haven't tried running it, but i would suggest converting your banned_sites hash into an array of regexes ... that way you don't have the seperate lists of fixed hosts in the hash, and host regexes in the is_banned_site method.

      For anyone use this program, my suggestion is to have a big hash for banned sites, but only a much smaller array for banned sites expressed with regexp.

      For example, if there are four sites you want to ban:

      • a.foo.com
      • b.foo.com
      • c.foo.com
      • d.foo.com

      It is better to put them all in the hash for fixed site, instead of using regexp, unless that site has a rich variety of names. If it only has three or four different names, put them in hash.

Re: An internet garbage filter
by dakkar (Hermit) on Oct 30, 2003 at 13:24 UTC

    I started with JunkBuster, then noticed its limitations (HTTP 1.0 only, mainly), and looked for something better.

    Now I'm using Squid with the AdZapper redirector. Works like a charm: automatic updating of the pattern list, quick, written in Perl...

    -- 
            dakkar - Mobilis in mobile
    

    Most of my code is tested...

    Perl is strongly typed, it just has very few types (Dan)

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: CUFP [id://302259]
Approved by davido
Front-paged by davido
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others perusing the Monastery: (5)
As of 2024-04-19 07:34 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found