Re (tilly) 1: An exercise in "That's not the way I'd do it" (TNTWIDI)

First of all I really like the idea of this thread. I think that the process of reviewing code is very instructive. This is particuarly true when experienced programmers review each other's code.

I hope you don't mind, but I picked a simpler piece of code, xstatswhore.pl to work with. It is at least related...

You have documentation aimed at users embedded in the code using #. If you put the same documentation in pod format it can be extracted with the perldoc utility.
You have key variables like $pm_site which are scoped for the whole program basically so that you can issue a request in one function and then do error handling in another. I prefer to put the error handling where the action happens, and if I can move that variable to a tighter scope, that is fine by me.
You have a known problem in your data - the home node is considered an article and you don't want it to be. Again there is action at a distance with the fix for the issue appearing nowhere near where the problem is. Were vroom to fix the bug on his end, you would also now be correcting for a non-existent bug and you would now get the wrong answer. I would try to find the home node and delete it.
The indenting on your XML parsing goes very deep. If you are going to leave the logic inline, using a standard small indent for each level would make it more readable for me. I keep to 80 columns.
On a related note, I know mirod will dislike me for saying this, but every time I look at the XML approach to a problem I have to wonder why it is so hard to work with. I am seriously tempted to handroll a solution to kill that dependency. I have had bad luck in the past on XML::* upgrades, and many have (as you note) trouble installing them. Plus I bet it would be faster.
In get_article_list, why do you pull the node in and then call it through @_ rather than using the variable that you have in scope?
I agree with you that max and min would be nice to have as native functions. But you can get them from List::Util, and that is good enough for me. Plus the way that it implements them is able to take a list as an argument.
I don't like the hard-coded constants in your summarize function. Also Perl is list-oriented. If you take the approach of first extracting an array of reps and then taking summary statistics on that, then you can simplify that code without running into bugs on users with only negative reps.
The histogram code is buggy in at least 2 ways. First of all the maximum rep has nothing to do with the widest bin. Secondly the home node is an extra article with a rep of 0. The second one shows, again, the dangers of fixing a bug at a distance from where it enters your code. Then you have to always remember it and fix it at a distance in multiple places.
In show_histogram I don't like the naming of the hash. On page 8 of Camel 2 there is the advice that a hash lookup is linguistically "of". So you say that $wife{Adam} = "Eve";. Given that the hash is has the size of the bin for its value, it deserves a name like %bin_count.
I do not like the way that show_histogram loops. In my experience looping by explicit incrementing values (particularly if you are incrementing several in parallel) is easier to get wrong than looping over a range.
A matter of personal taste. I tend to leave out tests on numbers of arguments. Then again I tend to like APIs with variable number of arguments. And if I want to be sure that the caller knows my API, I am likely to hack up a named argument approach - in my experience I am less likely to forget how many arguments I have than I am to forget what order they go in. So I omitted them, but that is very much a YMMV item.

And I guess that if I am going to say all of that, I should show what the revised version looks like:

#!/usr/local/bin/perl -w

use LWP::Simple;
use XML::Twig;
use URI::Escape;
use HTML::Entities;
use List::Util qw(min max sum);
use Carp;
use Getopt::Std;
use POSIX qw(ceil floor);
use strict;

{
  my %args;
  getopts('u:p:b:', \%args);
  # Replace these dies with username/password
  my $user = $args{u} ||  # You can put your name here
    die("You need to specify the user with -u");
  my $pass = $args{p} ||  # You can put your password here
    die("You need to specify the password with -p");
  my $bin_size = $args{b} || 5;
  my $articles = get_article_list($user, $pass);
  my $summary = summarize ($user, $articles);
  show_reps($summary);
  if (0 < $bin_size) {
    show_histogram($bin_size, $articles, $summary);
  }
}

# Takes user name/password, returns a hash of articles
# Sorry mirod, every time I look at this kind of thing,
# I have to wonder *why* programming for XML always
# seems so much harder than it needs to be.
sub get_article_list {
  my ($user, $pass) = @_;
  my %node_info;
  my $twig = new XML::Twig(
    TwigRoots => {
      NODE => sub {
        my ($t, $node) = @_;
        my $title = decode_entities($node->text());
        $title =~ s/&apos;/'/g; # Why is this missed?
        my %rec = (
          id => $node->att('id'),
          rep => $node->att('reputation'),
          title => $title,
          date => $node->att('createtime'),
        );
        if (exists $node_info{$rec{id}}) {
          confess("Duplicated id '$rec{id}'");
        }
        else {
          $node_info{$rec{id}} = \%rec;
        }
        $t->purge();
      },
    }
  );

  $twig->parse(
    get_pm_node("User nodes info xml generator", $user, $pass)
  );
  
  unless (%node_info) {
    confess("Couldn't fetch article info for $user with pass $pass");
  }
  
  # Vroom includes your home node as an article. :-(
  my $home_id = min (keys %node_info);
  if ($node_info{$home_id}{title} eq $user) {
    delete $node_info{$home_id};
  }
  else {
    warn("Is $home_id your home id?  Presuming not.\n");
  }

  return \%node_info;
}

# Takes a node name, plus optional name/password, and fetches the page
sub get_pm_node {
  my $node = uri_escape(shift);
  my $url = "http://www.perlmonks.org/index.pl?node=$node";
  if (@_) {
    my $user = uri_escape(shift);
    my $passwd = uri_escape(shift);
    $url .= "&op=login&user=$user&passwd=$passwd";
  }
  # Why doesn't LWP::Simple preserve the error?
  return get($url) or confess("Cannot fetch '$url'");
}

sub show_histogram {
  my $bin_size = shift;
  my $articles = shift;
  my %bin_count;
  foreach my $rep (map {$_->{rep}} values %$articles) {
    my $bin = floor( ($rep + 0.5) / $bin_size);
    ++$bin_count{$bin};
  }
  
  # Try to keep on one page
  my $width = 50;
  my $max_count = max (values %bin_count);
  my $scale = ceil($max_count / $width);
  
  print "  Reputation   Article Count\n";
  print "------------- -------", "-" x 50, "\n";

  my @bins = sort {$a <=> $b} keys %bin_count;
  foreach my $bin (min(@bins)..max(@bins)) {
    my $count = $bin_count{$bin} || 0;
    my $extra = ($count % $scale) ? '.' : '';
    my $start = $bin * $bin_size;
    my $end = $start + $bin_size - 1;
    printf "%4d .. %4d  \[%4d\] %s$extra\n",
      $start, $end, $count,
      '#' x floor ($count / $scale);
  }
  print "\n  Scale: #=$scale\n\n" if $scale > 1;
}

# This is essentially untouched.
sub show_reps {
   my $hsummary = shift;

   print  "\n";
   printf ("              User: %s\n",    $hsummary->{username});
   printf ("    Total articles: %d\n",    $hsummary->{articles});
   printf ("  Total reputation: %d\n",    $hsummary->{reputation});
   printf ("    Min reputation: %d\n",    $hsummary->{repmin});
   printf ("    Max reputation: %d\n",    $hsummary->{repmax});
   printf ("Average reputation: %3.2f\n", $hsummary->{average});
   print  "\n";
}

sub summarize {
  my $user = shift;
  my $articles = shift;
  my @reps = map {$_->{rep}} values %$articles;
  unless (@reps) {
    confess("No articles found.");
  }
  my %summary = (
    articles => scalar(@reps),
    repmax => max(@reps),
    repmin => min(@reps),
    reputation => sum(@reps),
    username => $user,
  );
  $summary{average} = $summary{reputation} / $summary{articles};
  return \%summary;
}

__END__

=head1 NAME

t_statswhore.pl - get summary statistics

=head1 SYNOPSIS

t_statswhore.pl C<-u> <user> C<-p> <password> [C<-b> <bin size>]

=head1 DESCRIPTION

This program goes to www.perlmonks.org and generates
summary statistics about your posts.  This only works
for your own account since reps are proprietary.

If you wish you can embed your name and password into the
code instead of passing them on the command line.

=head1 DEPENDENCIES

Quite a few.  LWP::Simple, XML::Twig, List::Util,
URI::Escape, and HTML::Entities.

=head1 TIPS

An interested developer should be able to fairly easily
remove the dependency on XML::Twig.  URI::Escape is
largely gratuitous, and the main purpose of
HTML::Entities is to make sure that people with strange
names can have their home node identified and removed
from the article list.

=head1 AUTHOR AND COPYRIGHT

Written by Ben Tilly based on the original by J.C.Wren.

Copyright 2000, 2001 J.C.Wren  jcwren@jcwren.com
Copyright 2001 by Ben Tilly

No rights reserved.  But jcwren likes hearing about
people using it.
[download]

Comment on Re (tilly) 1: An exercise in "That's not the way I'd do it" (TNTWIDI) Download Code


There's more than one way to do things
	PerlMonks