Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

topwebdiff - analyse the output of topweb

by grinder (Bishop)
on Sep 14, 2001 at 12:26 UTC ( #112378=sourcecode: print w/ replies, xml ) Need Help??

Category: web stuff
Author/Contact Info grinder on perlmonks
Description: To make the best use of topweb snapshots, the idea is to generate the files day by day, and then run topwebdiff to pinpoint the ranking changes.

See also topweb - Squid access.log analyser.
#! /usr/bin/perl -w
#
# david landgren  14-may-2001

use strict;

my $first  = shift or die "No first (current) file specified on comman
+d line.\n";
my $second = shift or die "No second (previous) file specified on comm
+and line.\n";

my %site;

open IN, $first or die "Cannot open $first for input: $!\n";
while( <IN> ) {
        chomp;
        my @fields = split;
        $site{ $fields[-1] } = \@fields;
}
close IN;

open IN, $second or die "Cannot open $second for input: $!\n";
while( <IN> ) {
        chomp;
        my ($rank, @fields) = split;
        local $" = "\t";
        if( defined $site{$fields[-1]} ) {
                my $prev = $site{ $fields[-1] }->[0];
                my $diff = $prev - $rank;
                my $desc = 0 == $diff ? '=' : $diff < 0 ? $diff : "+$d
+iff";
                print "$rank\t$prev\t$desc\t@fields\n";
        }
        else {
                print "$rank\t-\tnew\t@fields\n";
        }
}
close IN;

=head1 NAME

topwebdiff -- analyse the output of successive runs of topweb

=head1 SYNOPSIS

B<topwebdiff> filespec.recent filespec.older

=head1 DESCRIPTION

Take the output of two runs of topweb, and create a report that shows 
+how
sites have evolved between the two snapshots. This helps pinpoint site
+s
that suddenly suck up a dramatic amount of bandwidth.

=head1 EXAMPLES

C<topwebdiff tw.yyyymmd1 tw.yyyymmd2>

The output is equivalent to the output of C<topweb tw.yyyymmd1> with t
+he
addition of two columns in the second and third place:

=item *
rank 2 -- the rank of the same FQDN from the file tw.yyyymmd2, or '--'
+ if
the FQDN does not appear in the second file.

delta -- the change in rank from the second file (the older snapshot) 
+in
comparison with the first file (the newer snapshot).

An excerpt of the output from a sample data set is as follows. In this
example we see a site has jumped from 55th most visited site (in terms
+ of
bytes transferred) to 27th.

 20 21 +1  5671  29919621  0.483%  25.064% www.voyages-sncf.com
 21 20 -1  3532  27930698  0.451%  25.514% www.jobpilot.fr
 22 24 +2  11842 27849740  0.449%  25.964% www.societe.com
 23 22 -1  1807  25851714  0.417%  26.381% pub21.ezboard.com
 24 23 -1  4560  24280781  0.392%  26.773% www.google.fr
 25 26 +1  5326  24055482  0.388%  27.161% www.wanadoo.fr
 26 27 +1  3075  23879164  0.385%  27.546% perso.wanadoo.fr
 27 55 +28 3943  199970 28 30 +2  2313  19803044  0.320%  28.188% webm
+ail.libertysurf.fr
 29 25 -4  1446  19699499  0.318%  28.506% www.geocities.com
 30 28 -2  998   19288520  0.311%  28.817% lw10fd.law10.hotmail.msn.co
+m

Just how important this jump has to be weighed up with the number of f
+ile used
in generating the snapshot. In this instance, Squid is configured to r
+oll its
logs over every 24 hours, and 10 logs are kept. This means that the ou
+tput from
topweb (if run on all log files) will be a rolling 10-day average.

=head1 COPYRIGHT

Copyright (c) 2001 David Landgren.

This script is free software; you can redistribute it and/or modify it
under the same terms as Perl itself.

=head1 AUTHOR

     David "grinder" Landgren
     grinder on perlmonks (http://www.perlmonks.org/)
     eval {join chr(64) => qw[landgren bpinet.com]}

=cut

Comment on topwebdiff - analyse the output of topweb
Download Code

Back to Code Catacombs

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: sourcecode [id://112378]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others about the Monastery: (10)
As of 2015-07-29 22:42 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (269 votes), past polls