<?xml version="1.0" encoding="windows-1252"?>
<node id="112378" title="topwebdiff - analyse the output of topweb" created="2001-09-14 08:26:23" updated="2005-08-11 08:13:10">
<type id="1748">
sourcecode</type>
<author id="29008">
grinder</author>
<data>
<field name="doctext">
&lt;code&gt;#! /usr/bin/perl -w
#
# david landgren  14-may-2001

use strict;

my $first  = shift or die "No first (current) file specified on command line.\n";
my $second = shift or die "No second (previous) file specified on command line.\n";

my %site;

open IN, $first or die "Cannot open $first for input: $!\n";
while( &lt;IN&gt; ) {
        chomp;
        my @fields = split;
        $site{ $fields[-1] } = \@fields;
}
close IN;

open IN, $second or die "Cannot open $second for input: $!\n";
while( &lt;IN&gt; ) {
        chomp;
        my ($rank, @fields) = split;
        local $" = "\t";
        if( defined $site{$fields[-1]} ) {
                my $prev = $site{ $fields[-1] }-&gt;[0];
                my $diff = $prev - $rank;
                my $desc = 0 == $diff ? '=' : $diff &lt; 0 ? $diff : "+$diff";
                print "$rank\t$prev\t$desc\t@fields\n";
        }
        else {
                print "$rank\t-\tnew\t@fields\n";
        }
}
close IN;

=head1 NAME

topwebdiff -- analyse the output of successive runs of topweb

=head1 SYNOPSIS

B&lt;topwebdiff&gt; filespec.recent filespec.older

=head1 DESCRIPTION

Take the output of two runs of topweb, and create a report that shows how
sites have evolved between the two snapshots. This helps pinpoint sites
that suddenly suck up a dramatic amount of bandwidth.

=head1 EXAMPLES

C&lt;topwebdiff tw.yyyymmd1 tw.yyyymmd2&gt;

The output is equivalent to the output of C&lt;topweb tw.yyyymmd1&gt; with the
addition of two columns in the second and third place:

=item *
rank 2 -- the rank of the same FQDN from the file tw.yyyymmd2, or '--' if
the FQDN does not appear in the second file.

delta -- the change in rank from the second file (the older snapshot) in
comparison with the first file (the newer snapshot).

An excerpt of the output from a sample data set is as follows. In this
example we see a site has jumped from 55th most visited site (in terms of
bytes transferred) to 27th.

 20 21 +1  5671  29919621  0.483%  25.064% www.voyages-sncf.com
 21 20 -1  3532  27930698  0.451%  25.514% www.jobpilot.fr
 22 24 +2  11842 27849740  0.449%  25.964% www.societe.com
 23 22 -1  1807  25851714  0.417%  26.381% pub21.ezboard.com
 24 23 -1  4560  24280781  0.392%  26.773% www.google.fr
 25 26 +1  5326  24055482  0.388%  27.161% www.wanadoo.fr
 26 27 +1  3075  23879164  0.385%  27.546% perso.wanadoo.fr
 27 55 +28 3943  199970 28 30 +2  2313  19803044  0.320%  28.188% webmail.libertysurf.fr
 29 25 -4  1446  19699499  0.318%  28.506% www.geocities.com
 30 28 -2  998   19288520  0.311%  28.817% lw10fd.law10.hotmail.msn.com

Just how important this jump has to be weighed up with the number of file used
in generating the snapshot. In this instance, Squid is configured to roll its
logs over every 24 hours, and 10 logs are kept. This means that the output from
topweb (if run on all log files) will be a rolling 10-day average.

=head1 COPYRIGHT

Copyright (c) 2001 David Landgren.

This script is free software; you can redistribute it and/or modify it
under the same terms as Perl itself.

=head1 AUTHOR

     David "grinder" Landgren
     grinder on perlmonks (http://www.perlmonks.org/)
     eval {join chr(64) =&gt; qw[landgren bpinet.com]}

=cut
&lt;/code&gt;</field>
<field name="codedescription">
To make the best use of topweb snapshots, the idea is to generate the files day by day, and then run topwebdiff to pinpoint the ranking changes.
&lt;br&gt;&lt;br&gt;See also [id://112377].</field>
<field name="codecategory">
web stuff</field>
<field name="codeauthor">
grinder on perlmonks</field>
</data>
</node>
