#! /usr/bin/perl -w # # david landgren 14-may-2001 use strict; my $first = shift or die "No first (current) file specified on command line.\n"; my $second = shift or die "No second (previous) file specified on command line.\n"; my %site; open IN, $first or die "Cannot open $first for input: $!\n"; while( ) { chomp; my @fields = split; $site{ $fields[-1] } = \@fields; } close IN; open IN, $second or die "Cannot open $second for input: $!\n"; while( ) { chomp; my ($rank, @fields) = split; local $" = "\t"; if( defined $site{$fields[-1]} ) { my $prev = $site{ $fields[-1] }->[0]; my $diff = $prev - $rank; my $desc = 0 == $diff ? '=' : $diff < 0 ? $diff : "+$diff"; print "$rank\t$prev\t$desc\t@fields\n"; } else { print "$rank\t-\tnew\t@fields\n"; } } close IN; =head1 NAME topwebdiff -- analyse the output of successive runs of topweb =head1 SYNOPSIS B filespec.recent filespec.older =head1 DESCRIPTION Take the output of two runs of topweb, and create a report that shows how sites have evolved between the two snapshots. This helps pinpoint sites that suddenly suck up a dramatic amount of bandwidth. =head1 EXAMPLES C The output is equivalent to the output of C with the addition of two columns in the second and third place: =item * rank 2 -- the rank of the same FQDN from the file tw.yyyymmd2, or '--' if the FQDN does not appear in the second file. delta -- the change in rank from the second file (the older snapshot) in comparison with the first file (the newer snapshot). An excerpt of the output from a sample data set is as follows. In this example we see a site has jumped from 55th most visited site (in terms of bytes transferred) to 27th. 20 21 +1 5671 29919621 0.483% 25.064% www.voyages-sncf.com 21 20 -1 3532 27930698 0.451% 25.514% www.jobpilot.fr 22 24 +2 11842 27849740 0.449% 25.964% www.societe.com 23 22 -1 1807 25851714 0.417% 26.381% pub21.ezboard.com 24 23 -1 4560 24280781 0.392% 26.773% www.google.fr 25 26 +1 5326 24055482 0.388% 27.161% www.wanadoo.fr 26 27 +1 3075 23879164 0.385% 27.546% perso.wanadoo.fr 27 55 +28 3943 199970 28 30 +2 2313 19803044 0.320% 28.188% webmail.libertysurf.fr 29 25 -4 1446 19699499 0.318% 28.506% www.geocities.com 30 28 -2 998 19288520 0.311% 28.817% lw10fd.law10.hotmail.msn.com Just how important this jump has to be weighed up with the number of file used in generating the snapshot. In this instance, Squid is configured to roll its logs over every 24 hours, and 10 logs are kept. This means that the output from topweb (if run on all log files) will be a rolling 10-day average. =head1 COPYRIGHT Copyright (c) 2001 David Landgren. This script is free software; you can redistribute it and/or modify it under the same terms as Perl itself. =head1 AUTHOR David "grinder" Landgren grinder on perlmonks (http://www.perlmonks.org/) eval {join chr(64) => qw[landgren bpinet.com]} =cut