topwebdiff - analyse the output of topweb

Category:	web stuff
Author/Contact Info	grinder on perlmonks
Description:	To make the best use of topweb snapshots, the idea is to generate the files day by day, and then run topwebdiff to pinpoint the ranking changes. See also topweb - Squid access.log analyser.
#! /usr/bin/perl -w # # david landgren 14-may-2001 use strict; my $first = shift or die "No first (current) file specified on comman +d line.\n"; my $second = shift or die "No second (previous) file specified on comm +and line.\n"; my %site; open IN, $first or die "Cannot open $first for input: $!\n"; while( <IN> ) { chomp; my @fields = split; $site{ $fields[-1] } = \@fields; } close IN; open IN, $second or die "Cannot open $second for input: $!\n"; while( <IN> ) { chomp; my ($rank, @fields) = split; local $" = "\t"; if( defined $site{$fields[-1]} ) { my $prev = $site{ $fields[-1] }->[0]; my $diff = $prev - $rank; my $desc = 0 == $diff ? '=' : $diff < 0 ? $diff : "+$d +iff"; print "$rank\t$prev\t$desc\t@fields\n"; } else { print "$rank\t-\tnew\t@fields\n"; } } close IN; =head1 NAME topwebdiff -- analyse the output of successive runs of topweb =head1 SYNOPSIS B<topwebdiff> filespec.recent filespec.older =head1 DESCRIPTION Take the output of two runs of topweb, and create a report that shows +how sites have evolved between the two snapshots. This helps pinpoint site +s that suddenly suck up a dramatic amount of bandwidth. =head1 EXAMPLES C<topwebdiff tw.yyyymmd1 tw.yyyymmd2> The output is equivalent to the output of C<topweb tw.yyyymmd1> with t +he addition of two columns in the second and third place: =item * rank 2 -- the rank of the same FQDN from the file tw.yyyymmd2, or '--' + if the FQDN does not appear in the second file. delta -- the change in rank from the second file (the older snapshot) +in comparison with the first file (the newer snapshot). An excerpt of the output from a sample data set is as follows. In this example we see a site has jumped from 55th most visited site (in terms + of bytes transferred) to 27th. 20 21 +1 5671 29919621 0.483% 25.064% www.voyages-sncf.com 21 20 -1 3532 27930698 0.451% 25.514% www.jobpilot.fr 22 24 +2 11842 27849740 0.449% 25.964% www.societe.com 23 22 -1 1807 25851714 0.417% 26.381% pub21.ezboard.com 24 23 -1 4560 24280781 0.392% 26.773% www.google.fr 25 26 +1 5326 24055482 0.388% 27.161% www.wanadoo.fr 26 27 +1 3075 23879164 0.385% 27.546% perso.wanadoo.fr 27 55 +28 3943 199970 28 30 +2 2313 19803044 0.320% 28.188% webm +ail.libertysurf.fr 29 25 -4 1446 19699499 0.318% 28.506% www.geocities.com 30 28 -2 998 19288520 0.311% 28.817% lw10fd.law10.hotmail.msn.co +m Just how important this jump has to be weighed up with the number of f +ile used in generating the snapshot. In this instance, Squid is configured to r +oll its logs over every 24 hours, and 10 logs are kept. This means that the ou +tput from topweb (if run on all log files) will be a rolling 10-day average. =head1 COPYRIGHT Copyright (c) 2001 David Landgren. This script is free software; you can redistribute it and/or modify it under the same terms as Perl itself. =head1 AUTHOR David "grinder" Landgren grinder on perlmonks (http://www.perlmonks.org/) eval {join chr(64) => qw[landgren bpinet.com]} =cut

Back to Code Catacombs