Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

Re^3: Help update the Phalanx 100

by MarkusLaker (Beadle)
on Dec 22, 2004 at 23:46 UTC ( [id://416955]=note: print w/replies, xml ) Need Help??


in reply to Re^2: Help update the Phalanx 100
in thread Help update the Phalanx 100

I knocked up a quick program after reading Andy's comments on use.perl, and before finding this thread. It uses three ideas to eliminate the noise: it ignores all clients that download more than a hundred modules in a day; it ignores certain user agents that look like spiders; and it looks at all versions of a module downloaded each day, and ignores all but the most popular version. This last check is meant to filter out clients that download every version of a module within a few minutes.

What the program doesn't do yet is to mark standard modules.

Here are the code and results.

[~/perl/P100]$ cat rank-modules #!/usr/bin/perl use warnings; use strict; # Read CPAN module-download logs; find the most popular modules. ### # Number of modules to list: sub NrToPrint() {100} # Any address that pulls more than MaxDownloadsPerDay modules in any o +ne day # has all its traffic ignored: sub MaxDownloadsPerDay() {100} # Exclude downloads from agents matching this regex, because they seem + to be # related to mirroring or crawling rather than genuine downloads: my $rx_agent_ignore = qr/ \. google \. | \. yahoo \. | # \b LWP::Simple \b | \b MS\ Search \b | \b Webmin \b | \b Wget \b | \b teoma \b /x; # First pass: build a hash of all client addresses that have downloade +d more # than MaxDownloadsPerDay modules in any one day: my %bigusers; sub find_big_users($) { my $fh = $_[0]; seek $fh, 0, 0 or die "Can't rewind the input file:\n$!\n"; print STDERR "Finding heavy users...\n"; my %hpd; # hits per day: $hpd{client}{date} = number of hits while (<$fh>) { my ($client, $date) = m/ ^ ( \d+ ) \s+ ( [^:]+ ) /x or next; # $hpd{$client}{$date} ||= 0; ++ $hpd{$client}{$date}; } CLIENT: while (my ($client, $rdatehash) = each %hpd) { while (my ($date, $count) = each %$rdatehash) { undef $bigusers{$client}, next CLIENT if $count > MaxDownl +oadsPerDay; } } } # Second pass: ignoring traffic from heavy clients and robotic user ag +ents, # build a hash indexed by date, module and version and yielding a coun +t of # downloads: my $rx_parse = qr! ^ ( \d+ ) # Get client ID \s ( [^:]+ ) # Get date \S+ \s # Skip time / \S+ / # Skip directory ( \w \S*? ) # Get module name - # Skip delimiter ( (?: (?> \d [^.]* ) \.? )+ ) # Get version number \. \S+ \s # Skip file-type suffix " ( .* ) " # Get user agent !x; my $rawdownloads = 0; my $igbig = 0; my $igagent = 0; my $nrlines; sub count_downloads($) { my $fh = $_[0]; seek $fh, 0, 0 or die "Can't rewind the input file:\n$!\n"; print STDERR "Counting downloads...\n"; my %details; while (<$fh>) { my ($client, $date, $module, $version, $agent) = /$rx_parse/o or next; # print; # print "Mod $module, ver $version\n"; ++$rawdownloads; ++$igbig, next if exists $bigusers{$client}; ++$igagent, next if $agent =~ $rx_agent_ignore; ++ $details{$date}{$module}{$version}; } $nrlines = $.; \%details; } # Third pass: if multiple versions of the same module have been reques +ted on the # same day, ignore all but the most popular version for that day. Thi +s avoids # giving extra weight to modules with many historical versions if a cl +ient # downloads all of them. Produce a hash my $filtereddownloads = 0; sub condense_multiple_versions($) { my $rdetails = $_[0]; print STDERR "Analysing...\n"; my %grosscounts; while (my ($date, $rmodhash) = each %$rdetails) { while (my ($module, $rverhash) = each %$rmodhash) { my @vercounts = sort {$a <=> $b} values %$rverhash; $grosscounts{$module} += $vercounts[-1]; $filtereddownloads += $vercounts[-1]; } } \%grosscounts; } # Print the module counts and names in descending order of popularity: sub print_results($) { print STDERR "Using $filtereddownloads out of $rawdownloads downlo +ads on $nrlines lines.\n", "Skipped $igbig from heavy users and a further $igage +nt apparently from robots.\n\n"; my $rcounts = $_[0]; my @sorted = sort {$rcounts->{$b} <=> $rcounts->{$a}} keys %$rcoun +ts; print map {sprintf "%-8d%s\n", $rcounts->{$_}, $_} @sorted[0 .. NrToPrint - 1]; } sub main() { die "$0 <filename>\n" unless @ARGV == 1; my $infile = shift @ARGV; open my $fh, "<$infile" or die "Can't open $infile:\n$!\n"; find_big_users $fh; print_results condense_multiple_versions count_downloads $fh; } main; [~/perl/P100]$ ./rank-modules cpan-gets Finding heavy users... Counting downloads... Analysing... Using 104411 out of 1067155 downloads on 2328070 lines. Skipped 767228 downloads from heavy users and a further 177523 apparen +tly from robots. 2745 DBI 2312 File-Scan 1703 DBD-mysql 1219 XML-Parser 1202 HTML-Parser 1034 libwww-perl 984 GD 944 Gtk-Perl 880 Net_SSLeay.pm 859 Tk 827 DBD-Oracle 793 MIME-Base64 756 URI 751 Apache-ASP 746 Compress-Zlib 654 dmake 643 HTML-Template 640 Digest-MD5 602 Time-HiRes 592 Digest-SHA1 587 Archive-Tar 584 Net-Telnet 577 Template-Toolkit 548 Parallel-Pvm 540 XML-Writer 477 Archive-Zip 467 HTML-Tagset 464 libnet 437 Digest 406 AppConfig 401 MIME-tools 385 MailTools 359 Storable 356 Date-Calc 346 Msql-Mysql-modules 339 Test-Simple 338 CGI.pm 324 Module-Build 320 Spreadsheet-WriteExcel 318 SiePerl 317 perl-ldap 316 Net-DNS 314 DB_File 312 PAR 310 CPAN 310 TermReadKey 297 XML-Simple 297 IO-String 292 TimeDate 291 GDGraph 289 MIME-Lite 287 IO-stringy 287 Crypt-SSLeay 284 Curses 282 DBD-DB2 278 calendar 278 DateManip 277 Net-SNMP 274 Zanas 271 IMAP-Admin 270 MD5 268 ssltunnel 258 sms 257 Digest-HMAC 255 GDTextUtil 252 DBD-ODBC 252 DBD-Pg 245 gmailarchiver 245 IO-Socket-SSL 240 Data-Dumper 239 Mail-Sendmail 232 IOC 225 OLE-Storage_Lite 223 keywordsearch 217 ExtUtils-MakeMaker 206 XML-SAX 205 reboot 200 chres 199 Convert-ASN1 196 App-Info 196 Event 194 CGIscriptor 189 linkcheck 187 Test-Harness 184 glynx 184 Verilog-Perl 181 XLinks 180 Bit-Vector 179 mod_perl 178 SOAP-Lite 176 Expect 174 XML-DOM 174 MARC-Detrans 174 DBD-Sybase 173 Mail-SpamAssassin 172 Excel-Template 172 check_ftp 172 Compress-Zlib-Perl 171 Parse-RecDescent 171 Carp-Clan [~/perl/P100]$

Update 23 Dec 2004:

I have:

  • removed LWP::Simple from the list of ignorable user agents at stvn's suggestion,
  • updated the results listing, and
  • removed a fantastically noisy debugging statement that I inadvertently left in. (Apologies to anyone who ran the script and got barraged with raw data.)

Markus

Replies are listed 'Best First'.
Re^4: Help update the Phalanx 100
by stvn (Monsignor) on Dec 23, 2004 at 13:49 UTC
    # Exclude downloads from agents matching this regex, because they seem + to be # related to mirroring or crawling rather than genuine downloads: my $rx_agent_ignore = qr/     \. google \.            |     \. yahoo  \.            |     \b LWP::Simple \b       |     \b MS\ Search \b        |     \b Webmin \b            |     \b Wget \b              |     \b teoma \b /x;

    Markus, I may be wrong, but I think that CPAN.pm uses LWP::Simple sometimes to download modules with, so excluding this would not be a good idea even though there is a good chance it could also be a spider.

    -stvn
      Thanks, stvn! I've updated the code and results accordingly.

      Markus

Re^4: Help update the Phalanx 100
by petdance (Parson) on Jan 09, 2005 at 05:30 UTC
    Wget is absolutely a valid agent. It's what I use to download stuff to the command line so I can install the module.

    And Webmin is a package that people use for web-based maintenance. It's not a bot. That one needs to stay in, too.

    xoxo,
    Andy

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://416955]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others meditating upon the Monastery: (4)
As of 2024-04-24 12:18 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found