Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

norobotlog

by quartertone (Initiate)
on Sep 06, 2004 at 01:42 UTC ( [id://388697]=sourcecode: print w/replies, xml ) Need Help??
Category: Utility Scripts/Text Processing/Miscellaneous
Author/Contact Info Gary C. Wang (gary at quartertone.net)
www.quartertone.net
Description: I always look at my Apache server log files from the command line. It always bothered me to see "GET /robots.txt" contaminating the logs. It was frustrating trying to visually determine which were crawlers and which were actual users. So I wrote this little utility, which filters out requests were made from IP addresses which grab "robots.txt". I suspect there are GUI log parsers that might provide the same functionality, but 1) i don't need something that heavy, 2) I like to code, 3) imageekwhaddyawant.
#!/usr/bin/perl
use strict;
use warnings;
# Apache logs robots filter-outer
# Author: Gary C. Wang
# Contact: gary@quartertone.net
# Website: www.quartertone.net
# Filename: norobotlog
#
# Usage: norobotlog [logfile_name]
#
# This script parses Apache log files and 
#   filters out entries from IP addresses 
#   that request "robots.txt" file, commonly
#   associated with webcrawlers and site indexers.
# Prior to usage, check regexp to make sure it matches your log format
+.
# My log format is something like:
#  192.168.0.xx - - [11/Jul/2004:22:25:22 -0400] "GET /robots.txt HTTP
+/1.0" 200 78

my %robots;
my $ip_ptn = '((\d{1,3}\.){3}\d{1,3})'; # this regexp matches IP addre
+sses
my @file = <>; #file from stdin

# First, find out which IPs are associated with crawlers
foreach (@file) {
    # ----- Adjust this pattern to match your log file -----
    $robots{$1} ++ if m/^$ip_ptn .+?robots\.txt/;
}

# Then weed those out, printing only the ones that do not request robo
+ts.txt
foreach (@file) {
    if (m/$ip_ptn /) {
        print if ! defined $robots{$1};
    }
}
Replies are listed 'Best First'.
Re: norobotlog
by sintadil (Pilgrim) on Sep 11, 2004 at 13:46 UTC

    It may be a good idea to include other bot patterns, like the Googlebot and other search engine bots. Otherwise, this can be simplified to an egrep command, which is what I'd use anyway.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: sourcecode [id://388697]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chanting in the Monastery: (4)
As of 2024-09-18 21:15 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    The PerlMonks site front end has:





    Results (25 votes). Check out past polls.

    Notices?
    erzuuli‥ 🛈The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.