Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

Resolve addresses in web access logs

by ZZamboni (Curate)
on Apr 29, 2000 at 01:20 UTC ( #9650=sourcecode: print w/ replies, xml ) Need Help??

Category: WWW scripts
Author/Contact Info Diego Zamboni
Description: Where I work, apache is configured not to resolve IP addresses into names for the access logs. To be able to properly process the access logs for my pages with a log-processing program (I use webalizer) I wrote the following script to resolve the IP addresses. Note that the local domain name needs to be changed for when the resolved name is local (machine name only, without domain). This happens sometimes when the abbreviated name is before the full name in /etc/hosts, for example.

Updated: as suggested by kudra, added a comment to the code about double-checking the name obtained, and why we don't do it in this case.

#!/usr/local/perl/bin/perl -w
#
# Resolve IP addresses in web logs.
# Diego Zamboni, Feb 7, 2000

use Socket;

# Local domain mame
$localdomain=".your.local.domain";

while (<>) {
  @f=split;
  if ($f[0] =~ /^[\d.]+$/) {
    if ($cache{$f[0]}) {
      $f[0]=$cache{$f[0]};
    }
    else {
      $addr=inet_aton($f[0]);
      if ($addr) {
        $name=gethostbyaddr($addr, AF_INET);
        if ($name) {
      # NOTE: To ensure the veracity of $name, we really
      # would need to do a gethostbyname on it and compare
      # the result with the original $f[0], to prevent
      # someone spoofing us with false DNS information.
      # See the comments below. For this application,
      # we don't care too much, so we don't do this.
          # Fix local names
          if ($name !~ /\./) {
            $name.=$localdomain;
          }
          $cache{$f[0]}=$name;
          $f[0]=$name;
        }
      }
    }
    print join(" ", @f)."\n";
  }
  else {
    print $_;
  }
}

Comment on Resolve addresses in web access logs
Download Code
RE: Resolve addresses in web access logs (risk of gethostbyaddr)
by kudra (Vicar) on May 10, 2000 at 15:27 UTC
    Maybe I'm wrong about this, but it looks like you don't verify the name returned by gethostbyaddr. You probably don't need to if it's just for web statistics, but if you, like me, are in the habit of looking back over old code to remember how to do something, it might be a good idea to put that in or at least put in a comment about it, in case you need a more certain resolution for the ip in the future.

    There's a discussion of this in Perl Cookbook, section 17.7 ('Identifying the Other End of a Socket'). It basically says that because a name lookup goes to the name owner's DNS server, there's the possibility that the machine could give false information. Using gethostbyname and comparing the answer to find the original ip checks that. It also mentions that it's still not 100% secure.

    I wish I'd checked the code catacombs yesterday before I wrote my own version of this for exactly the same purpose. Bleh. Bad Kudra.

      Maybe you are wrong. or may be I am but:
      gethostbyaddr returns the names matching the ip. reverse name entries is as secure as dns gets. if the ip has a reverse name, the ip for that name will match the ip.
      the discussion in the Cookbook is about looking, whether the ipaddress you got when looking up by name, matches the original name. which it will not, unless you have a reverse entry for the same name.
        I don't think there's any looking up by name--in the example the IP was grabbed with getpeername and the name isn't known. If you were to get the ip with gethostbyname and then use gethostbyaddr on the result, you would be verifying it as they suggest, just in reverse.

        Quoting extensively from the Cookbook:
        "...If you want the name of the remote end, call gethostbyaddr to look up the name of the machine in the DNS tables, right?

        "Not really. That's only half the solution. Because a name lookup goes to the name's owner's DNS server and a lookup of an IP addresses goes to the address's owner's DNS server, you have to contend with the possibility that the machine that connecteed to you is giving incorrect names. For instance, the machine evil.crackers.org could belong to malevolent cyberpirates who tell their DNS server that its IP address (1.2.3.4) should be identified as trusted.dod.gov. If your program trusts trusted.dod.gov, a connection from evil.crackers.org will cause getpeername to return the right IP address (1.2.3.4), but gethostbyaddr will return the duplicitous name (my italics).

        "To avoid this problem, we take the (possibly deceitful) name returned by gethostbyaddr and look it up again with gethostbyname..."

        I'm just repeating, but it looks to me as if this is talking about gethostbyaddr having the potential to give incorrect information.

      You are correct. Reverse DNS can easily give wrong information (if the bad guy controls his DNS server, he also controls the reverse table). I know about this, but I don't care too much about it for web access statistics.

      --ZZamboni

        I wouldn't care either with web stats. I was just suggesting that as an example piece of code you might want to add a comment about that problem/feature so that if someone adapted it for an application which did require checking, s/he would know about it.

Back to Code Catacombs

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: sourcecode [id://9650]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (12)
As of 2014-07-28 14:42 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (200 votes), past polls