Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Help update the Phalanx 100

by petdance (Parson)
on Dec 21, 2004 at 00:58 UTC ( #416363=perlmeditation: print w/replies, xml ) Need Help??

The Phalanx 100 is a list of the "top 100" modules on CPAN, and by extension, those that should have the most attention paid to them by the Phalanx project.

The first time I generated the P100 was over a year ago, and things are old and stale. Distributions have changed names (CGI::Kwiki is now Kwiki, for example). Some distros have come and some have gone. It's time to be updated.

This time, YOU can help determine the P100. The source data, generated from logs from the main CPAN mirror at pair.com, is available for download at http://petdance.com/random/cpan-gets.gz. Write code that analyzes the data, and generates the top 100 modules.

What should your code do? It's up to you! Publish the code somewhere (use.perl.org, perlmonks, whatever) and let me see it. I'm not sure if I'll take someone's decisions directly, or use ideas, or how I'll do it, but the more working code I have to pick from, the better.

Also, the last time I created a P100, I omitted any modules that were in the core distribution. This time, I do want to include core modules, although I do want to have them noted somehow. Richard Clamp's Module::CoreList will be a great help with this.

Whatever you do, however you do it, I need to know about your code no later than January 10th, 2005. Email me at andy at petdance.com. There's going to be an article about the Phalanx project going up on perl.com soon after that, and I need to have an updated version of the P100 up (replacing http://qa.perl.org/phalanx/distros.html) by then.

About the data

I used the following code to analyze data from the Apache logs for the main CPAN mirror at Pair.com from November 1 to December 15th, 2004.

#!/usr/bin/perl use strict; use warnings; my %id; my $next_id = 10000; while (<>) { next unless m!^\S+ (\S+) .+ "GET ([^"]+) HTTP/\d\.\d" 200!; my ($ip,$path) = ($1,$2); study $path; # Skip directories next if $path =~ /\/$/; # Directory next if $path =~ /\/\?/; # Directory with sort parms # Skip certain directories next if $path =~ /^\/(icons|misc|ports|src)\//; # Skip certain file extensions next if $path =~ /\.(rss|html|meta|readme)$/; # Skip CPAN & distro maintenance stuff next if $path =~ /CHECKSUMS$/; next if $path =~ /MIRRORING/; # Module list stuff next if $path =~ /\Q00whois./; next if $path =~ /\Q01mailrc./; next if $path =~ /\Q02packages.details/; next if $path =~ /\Q03modlist./; my $id = ($id{$ip} ||= ++$next_id); print "$id $path\n"; }
This gives lines like this:
16395 /authors/id/K/KE/KESTER/WWW-Yahoo-DrivingDirections-0.07.tar +.gz 10001 /authors/id/K/KW/KWOOLERY/Buzznet-API-0.01.tar.gz 85576 /authors/id/J/JR/JROGERS/Net-Telnet-3.01.tar.gz 85576 /authors/id/J/JR/JROGERS/Net-Telnet-3.02.tar.gz 85576 /authors/id/J/JR/JROGERS/Net-Telnet-3.03.tar.gz
The 5-digit number is an ID number for a given IP address. I found that some IPs were routinely slurping down entire histories of modules, which probably will skew statistics to those with a lot of revisions.

How should these be accounted for in the analysis? I don't know. That's one of the reasons that I put this out for all to work on.

I welcome your comments, suggestions and help on this.

xoxo,
Andy

Replies are listed 'Best First'.
Re: Help update the Phalanx 100
by xdg (Monsignor) on Dec 21, 2004 at 03:48 UTC

    Would it be possible to regenerate this data including a datestamp? That would facilitate filtering out mass revision downloads, mirroring, etc. As a rough algorithm, I'd guess that counting each distribution (regardless of revision) once per IP per day would be a reasonable first-approximation.

    -xdg

    Code posted by xdg on PerlMonks is public domain. It has no warranties, express or implied. Posted code may not have been tested. Use at your own risk.

      As a rough algorithm, I'd guess that counting each distribution (regardless of revision) once per IP per day would be a reasonable first-approximation.

      Part of me agrees that revisions should probably be ignored, but then the counts will likely skew upwards if there are many revisions. However that skew could be viewed an indicator that the modules development is being followed regularly and not just downloaded as a curiosity. It's a difficult one.

      As for the "once per IP" idea, I have found with log analysis that IPs can be deceiving because of proxy servers. So I think that we should be wary of weighting the IPs too much.

      -stvn
      OK, you've now got the timestamp and the user agent. Lotsa Googlebot action there.

      xoxo,
      Andy

Re: Help update the Phalanx 100
by stvn (Monsignor) on Dec 21, 2004 at 04:47 UTC

    So, I am giving some thought to this,... but in the meantime I thought it would be interesting to see just some very basic raw download counts. In particular to see raw download counts on my own modules. (After all, CPAN doesn't automatically give this type of info and its not everyday you get access to nicely cleaned up logs like this) So I whipped this up, just grep the raw log for a PAUSE id and pipe the output into this script (passing in 'module' or 'distro' or nothing for both).

    BTW - The reg-exp for extracting the module name is horrible and could use a lot of help (*hint* *hint*).

    -stvn
Re: Help update the Phalanx 100
by xdg (Monsignor) on Dec 21, 2004 at 13:05 UTC

    This process strikes me as one that is a mix of art and science. An additional thought is that you might wish to cross-check the "top 100" from CPAN logs with the modules that are packaged for inclusion in various *nix distributions, which indicates some degree of demand. Likewise, if you can get access to ActiveState PPM logs, that would be another valuable source of data.

    -xdg

    Code posted by xdg on PerlMonks is public domain. It has no warranties, express or implied. Posted code may not have been tested. Use at your own risk.

Re: Help update the Phalanx 100
by xdg (Monsignor) on Dec 21, 2004 at 16:11 UTC

    Procrastination is a joy in the week before Christmas...

    I've taken a quick and dirty cut at it (please, no deductions for bad style or inefficiency). Like stvn said, the regex parts could use some work, but, frankly, there are so many special cases of how people name and number stuff that it's hard to cover everything. (Can anyone post what PAUSE and CPAN use internally?) The basic logic I use is this:

    • After seeing the dates and UA's and running some basic counts on the number of hits by IP, I decided to simply drop as outliers any IPs in the top 1% of IPs by hitcount. The cutoff winds up being about 64 hits over the period studied.
    • Only process things ending in .tar.gz or .tgz to avoid lots of non distribution stuff (scripts, readme, .sig, etc.)
    • Strip off the .tar.gz or .gz; Strip off the distribution version. (Ugly hack of a regex, given all the varations people use but s/(-|_)[0-9][0-9a-z._]+$// seems to work decently.)
    • Strip a trailing .pm, more for style than anything else
    • Find the 1% cutoff of IP's
    • Count the number of distribution downloads (but only if the downloading IP isn't past the cutoff.)
    • Print the top 100

    There's probably some additional, case-by-case, cleanup that could be done. (E.g., "perl" itself is in the top 100), but I think this is a decent start. Code and results follow.

    -xdg

    Code posted by xdg on PerlMonks is public domain. It has no warranties, express or implied. Posted code may not have been tested. Use at your own risk.

      So I was giving this some thought on my commute this morning as I was stuck in traffic, and the list produced by your code actually has helped convince me even more. I think that there is a problem with ignoring the revision number.

      As I looked over your list, I noticed at the bottom was a module of mine IOC. Now, rather then be flattered by this, I know that it's height on the list is quite artificial. I decided to adopt the XP idea of "release early and release often" with this module. The first released version (0.01) was on Oct. 15th of this year, and there have been 18 subsequent versions released, the last one on the 15th of Dec. and at least 10 versions have been released within the range of this log file (Nov. 1 - Dec. 15th). When I ran my script (see above) like this:

      grep 'IOC' ~/Desktop/cpan-gets | perl test.pl
      I got the following output:
      +--------------------------------------- | Total Downloads by Module +------+-------------------------------- | 609 | IOC +--------------------------------------- | Total Downloads by Distro +------+-------------------------------- | 10 | IOC-0.06.tar.gz | 10 | IOC-0.01.tar.gz | 10 | IOC-0.17.tar.gz | 10 | IOC-0.03.tar.gz | 10 | IOC-0.04.tar.gz | 10 | IOC-0.05.tar.gz | 11 | IOC-0.02.tar.gz | 18 | IOC-0.07.tar.gz | 44 | IOC-0.09.tar.gz | 46 | IOC-0.13.tar.gz | 50 | IOC-0.12.tar.gz | 52 | IOC-0.14.tar.gz | 54 | IOC-0.10.tar.gz | 59 | IOC-0.08.tar.gz | 64 | IOC-0.15.tar.gz | 66 | IOC-0.11.tar.gz | 85 | IOC-0.16.tar.gz +------+--------------------------------
      Clearly this module is not one of the top 100 on CPAN.

      I think we need to give some thought as to how to include revisions into the analysis. My first thought is to maybe take the number of revisions found on the list, and to use that to somehow weight the results. The more revisions the less weight basically. Another thought is to somehow account for the number of downloads per-revision. As I mentioned above, the fact each revision is being downloaded shows that someone is following the development of the module, and so that should taken into account.

      In the end I agree, this is going to be a mixture of art and science to come up with these top 100.

      -stvn

        That's a very good point and a great example. I tried two more cuts. One is top 100 by average number of downloads per revision. The second is based on the vector sum (sqrt(x**2 + y**2)) of total downloads and average per revision. (Technically, I took the log of the total to flatten the skew, and normalized both metrics to a maximum of 100 before taking the vector sum). That latter one is probably pretty good -- it accounts for both criteria. Depending on one's bias, one could weight the two factors differently in the sum.

        Results follow. Code for each of my three variations is available from my subversion repository.

        -xdg

        Code posted by xdg on PerlMonks is public domain. It has no warranties, express or implied. Posted code may not have been tested. Use at your own risk.

        I knocked up a quick program after reading Andy's comments on use.perl, and before finding this thread. It uses three ideas to eliminate the noise: it ignores all clients that download more than a hundred modules in a day; it ignores certain user agents that look like spiders; and it looks at all versions of a module downloaded each day, and ignores all but the most popular version. This last check is meant to filter out clients that download every version of a module within a few minutes.

        What the program doesn't do yet is to mark standard modules.

        Here are the code and results.

        [~/perl/P100]$ cat rank-modules #!/usr/bin/perl use warnings; use strict; # Read CPAN module-download logs; find the most popular modules. ### # Number of modules to list: sub NrToPrint() {100} # Any address that pulls more than MaxDownloadsPerDay modules in any o +ne day # has all its traffic ignored: sub MaxDownloadsPerDay() {100} # Exclude downloads from agents matching this regex, because they seem + to be # related to mirroring or crawling rather than genuine downloads: my $rx_agent_ignore = qr/ \. google \. | \. yahoo \. | # \b LWP::Simple \b | \b MS\ Search \b | \b Webmin \b | \b Wget \b | \b teoma \b /x; # First pass: build a hash of all client addresses that have downloade +d more # than MaxDownloadsPerDay modules in any one day: my %bigusers; sub find_big_users($) { my $fh = $_[0]; seek $fh, 0, 0 or die "Can't rewind the input file:\n$!\n"; print STDERR "Finding heavy users...\n"; my %hpd; # hits per day: $hpd{client}{date} = number of hits while (<$fh>) { my ($client, $date) = m/ ^ ( \d+ ) \s+ ( [^:]+ ) /x or next; # $hpd{$client}{$date} ||= 0; ++ $hpd{$client}{$date}; } CLIENT: while (my ($client, $rdatehash) = each %hpd) { while (my ($date, $count) = each %$rdatehash) { undef $bigusers{$client}, next CLIENT if $count > MaxDownl +oadsPerDay; } } } # Second pass: ignoring traffic from heavy clients and robotic user ag +ents, # build a hash indexed by date, module and version and yielding a coun +t of # downloads: my $rx_parse = qr! ^ ( \d+ ) # Get client ID \s ( [^:]+ ) # Get date \S+ \s # Skip time / \S+ / # Skip directory ( \w \S*? ) # Get module name - # Skip delimiter ( (?: (?> \d [^.]* ) \.? )+ ) # Get version number \. \S+ \s # Skip file-type suffix " ( .* ) " # Get user agent !x; my $rawdownloads = 0; my $igbig = 0; my $igagent = 0; my $nrlines; sub count_downloads($) { my $fh = $_[0]; seek $fh, 0, 0 or die "Can't rewind the input file:\n$!\n"; print STDERR "Counting downloads...\n"; my %details; while (<$fh>) { my ($client, $date, $module, $version, $agent) = /$rx_parse/o or next; # print; # print "Mod $module, ver $version\n"; ++$rawdownloads; ++$igbig, next if exists $bigusers{$client}; ++$igagent, next if $agent =~ $rx_agent_ignore; ++ $details{$date}{$module}{$version}; } $nrlines = $.; \%details; } # Third pass: if multiple versions of the same module have been reques +ted on the # same day, ignore all but the most popular version for that day. Thi +s avoids # giving extra weight to modules with many historical versions if a cl +ient # downloads all of them. Produce a hash my $filtereddownloads = 0; sub condense_multiple_versions($) { my $rdetails = $_[0]; print STDERR "Analysing...\n"; my %grosscounts; while (my ($date, $rmodhash) = each %$rdetails) { while (my ($module, $rverhash) = each %$rmodhash) { my @vercounts = sort {$a <=> $b} values %$rverhash; $grosscounts{$module} += $vercounts[-1]; $filtereddownloads += $vercounts[-1]; } } \%grosscounts; } # Print the module counts and names in descending order of popularity: sub print_results($) { print STDERR "Using $filtereddownloads out of $rawdownloads downlo +ads on $nrlines lines.\n", "Skipped $igbig from heavy users and a further $igage +nt apparently from robots.\n\n"; my $rcounts = $_[0]; my @sorted = sort {$rcounts->{$b} <=> $rcounts->{$a}} keys %$rcoun +ts; print map {sprintf "%-8d%s\n", $rcounts->{$_}, $_} @sorted[0 .. NrToPrint - 1]; } sub main() { die "$0 <filename>\n" unless @ARGV == 1; my $infile = shift @ARGV; open my $fh, "<$infile" or die "Can't open $infile:\n$!\n"; find_big_users $fh; print_results condense_multiple_versions count_downloads $fh; } main; [~/perl/P100]$ ./rank-modules cpan-gets Finding heavy users... Counting downloads... Analysing... Using 104411 out of 1067155 downloads on 2328070 lines. Skipped 767228 downloads from heavy users and a further 177523 apparen +tly from robots. 2745 DBI 2312 File-Scan 1703 DBD-mysql 1219 XML-Parser 1202 HTML-Parser 1034 libwww-perl 984 GD 944 Gtk-Perl 880 Net_SSLeay.pm 859 Tk 827 DBD-Oracle 793 MIME-Base64 756 URI 751 Apache-ASP 746 Compress-Zlib 654 dmake 643 HTML-Template 640 Digest-MD5 602 Time-HiRes 592 Digest-SHA1 587 Archive-Tar 584 Net-Telnet 577 Template-Toolkit 548 Parallel-Pvm 540 XML-Writer 477 Archive-Zip 467 HTML-Tagset 464 libnet 437 Digest 406 AppConfig 401 MIME-tools 385 MailTools 359 Storable 356 Date-Calc 346 Msql-Mysql-modules 339 Test-Simple 338 CGI.pm 324 Module-Build 320 Spreadsheet-WriteExcel 318 SiePerl 317 perl-ldap 316 Net-DNS 314 DB_File 312 PAR 310 CPAN 310 TermReadKey 297 XML-Simple 297 IO-String 292 TimeDate 291 GDGraph 289 MIME-Lite 287 IO-stringy 287 Crypt-SSLeay 284 Curses 282 DBD-DB2 278 calendar 278 DateManip 277 Net-SNMP 274 Zanas 271 IMAP-Admin 270 MD5 268 ssltunnel 258 sms 257 Digest-HMAC 255 GDTextUtil 252 DBD-ODBC 252 DBD-Pg 245 gmailarchiver 245 IO-Socket-SSL 240 Data-Dumper 239 Mail-Sendmail 232 IOC 225 OLE-Storage_Lite 223 keywordsearch 217 ExtUtils-MakeMaker 206 XML-SAX 205 reboot 200 chres 199 Convert-ASN1 196 App-Info 196 Event 194 CGIscriptor 189 linkcheck 187 Test-Harness 184 glynx 184 Verilog-Perl 181 XLinks 180 Bit-Vector 179 mod_perl 178 SOAP-Lite 176 Expect 174 XML-DOM 174 MARC-Detrans 174 DBD-Sybase 173 Mail-SpamAssassin 172 Excel-Template 172 check_ftp 172 Compress-Zlib-Perl 171 Parse-RecDescent 171 Carp-Clan [~/perl/P100]$

        Update 23 Dec 2004:

        I have:

        • removed LWP::Simple from the list of ignorable user agents at stvn's suggestion,
        • updated the results listing, and
        • removed a fantastically noisy debugging statement that I inadvertently left in. (Apologies to anyone who ran the script and got barraged with raw data.)

        Markus

      Wow, excluding the top 1% of downloaders. Brilliant.

      Something else to think about: I'm not concerned about absolute rankings so much as developing strata, as in http://qa.perl.org/phalanx/distros.html. I see this sort of like those "Greatest Albums Of All Time" lists. Maybe you can argue about whether Let It Bleed should come before or after Abbey Road, but both belong in the top 10, well before Pleased To Meet Me or Sign O' The Times.

      xoxo,
      Andy

        In my occasional noodlings on this topic, I've wondered what the dependency graph looks like. Which modules are most frequently used in other modules? (Including the recursion -- if A uses B and B uses C, D, and E, then the existence of A should increment the dependency count of C, D, and E, too.) Presumably, core modules would have the most links, but there are likely a second strata of heavily used utility modules, and so on out to narrow, single-purpose modules for particular applications. (Though those could also be very popular and worth of inclusion in a top 100 list.)

        -xdg

        Code posted by xdg on PerlMonks is public domain. It has no warranties, express or implied. Posted code may not have been tested. Use at your own risk.

Re: Help update the Phalanx 100
by petdance (Parson) on Dec 21, 2004 at 05:56 UTC
Re: Help update the Phalanx 100
by DrHyde (Prior) on Dec 21, 2004 at 10:21 UTC
    This sounds like a job for ... AWK!

    As an aside, it would be interesting to see similar data for a non-US mirror, or indeed for several mirrors from different countries. Several modules are US-specific (eg Geo::Postalcode) and others are of little interest to USians.

Re: Help update the Phalanx 100
by stvn (Monsignor) on Dec 21, 2004 at 17:12 UTC

    I was talking to dragonchild and he brought up a good point, that many of these downloads could be just CPAN smoke testers. Then talking to Andy, he was unsure if there was any way to tell if they were CPAN smoke testers. Anyone out there have any ideas on how we might be able to tell?

    -stvn
      Don't SMOKE testers send an automated result back to CPAN? If they do, one could decrement for every SMOKE result for a given module ...

      Being right, does not endow the right to be rude; politeness costs nothing.
      Being unknowing, is not the same as being stupid.
      Expressing a contrary opinion, whether to the individual or the group, is more often a sign of deeper thought than of cantankerous belligerence.
      Do not mistake your goals as the only goals; your opinion as the only opinion; your confidence as correctness. Saying you know better is not the same as explaining you know better.

        Don't SMOKE testers send an automated result back to CPAN? If they do, one could decrement for every SMOKE result for a given module ...
        Ehmm.. I send a CPAN testers result for every module I install with CPANPLUS, 90 % of which is for actual use (the rest is just to test some modules that people are having problems with on this site), so these should probably count.

        Massive smoke-testing shouldn't really influence the top-100 much anyway, since it would presumably make for about the same amount of downloads for each version of each module.

        Golly, then you'd end up with negative numbers. Remember you are given the downloads logs from one site only. CPAN testers download from one of the many CPAN mirrors.

        Maybe those logs could be used to write annoying pop-ups that will display "73% of the people who downloaded <the-module-you-just-downloaded> also like <some-other-module>".

Re: Help update the Phalanx 100
by Maddingue (Sexton) on Dec 26, 2004 at 02:30 UTC
    Here is a first list of the modules that are the most frequently listed as prereqs in META.yml (which is about a third of the distributions present on the CPAN). Dirty scripts which produced this are available on demand.
    87 DBI 63 URI 63 Kwiki 58 Params::Validate 56 Template 56 DateTime 55 LWP::UserAgent 52 POE 46 YAML 41 WWW::Mechanize 40 Class::Accessor 34 LWP 33 XML::LibXML 32 Class::DBI 32 URI::Escape 28 Error 28 HTML::Parser 27 Compress::Zlib 25 mod_perl 25 Parse::RecDescent 24 LWP::Simple 24 Test::Exception 24 Date::Parse 24 XML::Parser 24 Time::Piece 23 Tk 22 Email::Simple 22 HTTP::Request 22 HTML::Entities 22 HTML::Mason 21 File::Find::Rule 21 Test::Pod 20 IO::Scalar 20 Apache::Request 20 Digest::SHA1 20 Clone 19 CGI::Application 18 IO::Zlib 18 File::Slurp 17 XML::Simple 17 UNIVERSAL::require 17 Exception::Class 17 GD 16 DBD::SQLite 16 XML::RSS 15 Class::MethodMaker 15 Cache::Cache 15 Pod::Simple 15 Spiffy 15 Class::Data::Inheritable 15 XML::SAX 14 Image::Size 14 Maypole 14 Carp::Assert 14 HTML::TokeParser 14 Date::Calc 14 Test::Builder::Tester 13 Class::Accessor::Fast 13 Log::Log4perl 13 Object::props 12 HTML::Template 12 CGI::Session 12 DBD::mysql 12 Inline 12 Log::Dispatch 12 XML::XPath 12 HTTP::Cookies 11 SOAP::Lite 11 HTML::TreeBuilder 11 Email::Address 11 XML::LibXSLT 11 Net::LDAP 11 Tie::IxHash 11 Regexp::Common 11 Win32::OLE 11 Term::ReadKey 11 Archive::Tar 10 Net::DNS 10 IO::String 10 HTTP::Date 10 Module::Build 10 CGI::Builder 10 Apache 10 Date::Manip 10 load 10 CGI::Wiki 10 Archive::Zip 10 Wx 10 Apache::Cookie 10 Image::Magick 9 DateTime::Set 9 Gtk2 9 Class::Accessor::Chained 9 Apache::Test 9 IO::All 9 Algorithm::Diff 9 Mail::Sendmail 9 Test::Pod::Coverage 9 Email::MIME 9 Date::Format 9 Mail::Address 9 Math::Symbolic 8 Class::props 8 Email::Valid 8 HTML::TokeParser::Simple 8 Exporter::Lite 8 Apache::Session 8 MP3::Info 8 Locale::Maketext::Simple 7 Object::groups 7 File::Type 7 GraphViz 7 BerkeleyDB 7 Mail::Internet 7 Config::General 7 URI::URL 7 HTTP::Response 7 HTTP::Status 7 Module::Pluggable 7 Text::Template
Re: Help update the Phalanx 100
by gmpassos (Priest) on Dec 23, 2004 at 15:28 UTC
    I think that this list of 100 modules should be generated by voting and not by a log. Important is not to help the modules that are already in use, but get modules that can bring new resources and possibilities for Perl.

    For example, the module wxPerl brings a new world to Perl and really should be in the list, but the log won't show that because the main point where the users get this module is from binaries at SourceForge, since is a module that need to much things to compile our won binaries.

    Also we have a lot of new modules that are very interesting and should be in the list, or CORE modules that are not downloaded, like threads, that are not 100% yet.

    Graciliano M. P.
    "Creativity is the expression of liberty".

      The goal of Phalanx is to have the most commonly used modules on CPAN have a good, solid test suite so that these 100 modules can be used as the test bed for Ponie when it comes around. Ponie's going to need a lot of test material to run against. If DBI doesn't run on Ponie, then Ponie is broken.

      Your reasons make sense, but don't match the purpose behind Phalanx.

      xoxo,
      Andy

Re: Help update the Phalanx 100
by cosimo (Hermit) on Jan 04, 2005 at 13:20 UTC

    Hi Andy,

    I put the log entries in a postgres db, and now I'm running some sql queries with a "run-and-check-result" approach. What I'm trying to do is applying my personal own "common sense" interpretation to that data and see what comes out. Tonight I should see some interesting result.

Put it in a database...
by Anonymous Monk on Jan 10, 2005 at 19:11 UTC

    I worked on this over the holidays. I got something kludged together that generated results in less than 4 minutes using only the ip id's that had been used for 100 or fewer gets. I achieved this by munging the cpan-gets logfile and turning it into a database dump file suitable for import into SQLite or PostgreSQL. Since the problem is fundamentally about munging out the significant data and querying it, using a structured querying language seemed a natural fit. The issue of counting core modules and recursive dependencies is outside the domain of what I chose to focus on.

    I've attached my script. It takes the name of the log file as a parameter and generates a phalanx.dump file in the current working directory. You can then use a command like:

      cat phalanx.dump|sqlite3 phalanx.db
    

    To create the database. It isn't particularly elegant, but it gets the job done.

    Do you realize that 277,373 of the cpan-gets records are duplicates?

    I broke each record out into separate vars for timestamp, ipaddr id, module, module version, and useragent. While doing this for each record, I generated id's for each unique module, module/version, and useragent string. Finally, I generated a unique get id using all of the above as the hash key. I.e. something like:

    while (...) {
      ...
      if ($g{$ts, $ip_id, $mod_id, $modver_id, $agent_id}) {
        $duplicates++;
      } else {
        $g{$ts, $ip_id, $mod_id, $modver_id, $agent_id} = ++$gid;
      }
    }
    

    After processing cpan-gets, printing the value of $duplicates gives: 277373. I highly suspect cpan-gets has duplicate data. It isn't likely after all that the exact same version of a module was downloaded by the exact same IP and UserAgent at the exact same time...

    Here's an SQL query to get the top 100:

    SELECT Module.Name, COUNT(Gets.Module_id)
    FROM Module
    INNER JOIN Gets
       ON Module.ID = Gets.Module_id
    INNER JOIN (
      SELECT IPAddr_id, COUNT(*) as Tally FROM Gets
      GROUP BY IPAddr_id HAVING Tally <= 100
    ) as gip
       ON Gets.IPAddr_id = gip.IPAddr_id
    GROUP BY Module.Name
    ORDER BY COUNT(Gets.Module_id) DESC;
    

    I'd never used SQLite directly before, so I wasn't sure how best to twiddle the indexes to give it reasonable performance. I was very disappointed with the performance of queries using simple joins. Without indexes, the performance is awful. Using the sub-select makes it churn long enough for the powersaving setting on my laptop to kick into hibernation. Probably yanking the sub-select out into a temporary table and putting indexes on that would give better results, but I ran out of time. So, if you're using SQLite, you might want to run the simpified version that follows.

    SELECT Module.Name, COUNT(Gets.Module_id)
    FROM Module
    INNER JOIN Gets
       ON Module.ID = Gets.Module_id
    GROUP BY Module.Name
    ORDER BY COUNT(Gets.Module_id) DESC;
    

    The nice thing about putting the data into a database, is that it then becomes simple for people to tinker with and generate their own queries...

    Code follows...

    #!/usr/bin/perl -s -w
    
    our (%month);
    @month{qw{Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec}} = (1..12);
    my ($id, $ts, $path, $agent, $ext, $module, $version);
    
    my (%ip, %a, %m, %mv, %g);
    my ($aid, $mid, $mvid, $gid);
    $aid=$mid=$mvid=$gid=0;
    
    my $count=0;
    my $dups=0;
    while (<>) {
        next unless m/^(\S+) (\S+) (\S+) "(.+)"$/;
        ($id,$ts,$path,$agent) = ($1,$2,$3,$4);
    
        $agent=~s/'/''/g;
        next unless $path =~ m/\.(tar\.gz|par|tgz|zip|pm|pm\.gz)$/;
        $ext = $1;
    
        # DD/MMM/YYYY:HH:mm:SS => YYYY/MM/DD HH:mm:SS
        $ts =~ s/(\d{2})\/(\w{3})\/(\d{4}):(.+)$/$3\/$month{$2}\/$1 $4/;
    
        # /authors/id/S/SP/SPACEMAN/WebJay.pm
        ($module) = $path =~ m/(^\/+)\Q.$ext\E$/;
        ($version) = $module =~ m/(?:-(^-+))$/;
        $module =~ s/(?:-(^-+))$//;
        $module =~ s/\.pm$//;
        $version ||= 'undef';
    
        $ip{$id}||=$id;
        $a{$agent}||=++$aid;
        $m{$module}||=++$mid;
        $mv{$m{$module},$version}||=++$mvid;
        if ($g{$ts,$id,$mid,$mvid,$aid}) {
           $dups++;
        } else {
          $g{$ts,$id,$mid,$mvid,$aid}=++$gid;
        }
    
        if ($count++>=1000) {
    	$count=0;
    	print '.';
        }
    }
           print "$dups duplicates exist\n";
    
    open OUT, '>phalanx.db';
    
    print OUT <<EOF;
    BEGIN TRANSACTION;
    CREATE TABLE IPAddr (ID INTEGER PRIMARY KEY);
    EOF
    print OUT <<"EOF"  for sort {$a<=>$b} keys %ip;
    INSERT INTO "IPAddr" VALUES($_);
    EOF
    
    print OUT "CREATE TABLE Agent (ID INTEGER PRIMARY KEY, Desc VARCHAR(256));\n";
    print OUT <<"EOF"  for sort {$a{$a}<=>$a{$b}} keys %a;
    INSERT INTO "Agent" VALUES($a{$_}, '$_');
    EOF
    
    print OUT "CREATE TABLE Module (ID INTEGER PRIMARY KEY, Name VARCHAR(256));\n";
    print OUT <<"EOF"  for sort {$m{$a}<=>$m{$b}} keys %m;
    INSERT INTO "Module" VALUES($m{$_}, '$_');
    EOF
    
    my $chr28=chr(28);
    
    print OUT "CREATE TABLE ModVer (ID INTEGER PRIMARY KEY, Module_id INTEGER,\n     Version VARCHAR(16));\n";
    for my $key (sort {$mv{$a}<=>$mv{$b}} keys %mv) {
        my ($mid,$version) = split /$chr28/, $key;
        my ($mvid) = $mv{$key};
        print OUT <<"EOF";
    INSERT INTO "ModVer" VALUES($mvid, $mid, '$version');
    EOF
    }
    
    #    $g{$ts,$id,$mid,$mvid,$aid}||=++$gid;
    print OUT "CREATE TABLE Gets (ID INTEGER PRIMARY KEY, ts TIMESTAMP, IPAddr_id INTEGER,\n   Module_id INTEGER, ModVer_id INTEGER, Agent_id INTEGER);\n";
    for my $key (sort {$g{$a}<=>$g{$b}} keys %g) {
        my ($ts,$id,$mid,$mvid,$aid) = split /$chr28/, $key;
        my ($gid) = $g{$key};
        print OUT <<"EOF";
    INSERT INTO "Gets" VALUES($gid, '$ts', $id, $mid, $mvid, $aid);
    EOF
    }
    
    print OUT <<EOF;
    CREATE INDEX fk_module_id ON Gets (Module_id);
    COMMIT;
    EOF
    
    __END__
    
    SELECT Module.Name, COUNT(Gets.Module_ID)
    FROM Module
    INNER JOIN Gets
       ON Module.ID = Gets.Module_id
    INNER JOIN (
      SELECT IPAddr_id, COUNT(*) as Tally FROM Gets
      GROUP BY IPAddr_id HAVING Tally <= 100
    ) as gip
       ON Gets.IPAddr_id = gip.IPAddr_id
    GROUP BY Module.Name
    ORDER BY COUNT(Gets.Module_ID) ASC;
    
    SELECT IPAddr_ID, COUNT(*) FROM Gets GROUP BY IPAddr_ID ORDER BY IPAddr_id ASC;
    

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlmeditation [id://416363]
Approved by Old_Gray_Bear
Front-paged by Old_Gray_Bear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others cooling their heels in the Monastery: (3)
As of 2020-07-05 14:24 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?