The data set I'd love to get is the number of nodes and sum of node reputations for initial posts and replies in each category of Perlmonks. If I had that by user, plus user XP and maybe even date user joined, that would be a fantastic data set.
The reason that "by user" helps is that it easily allows clearing out outliers like the nodereaper and zombies. For anonymity, the data set doesn't even need to have user name/home-node id -- though that doesn't really protect the anonymity of the Saints in our book. If by user (even masked) isn't sufficiently anonymous, then those same stats summarized by monk level would be sufficient, as long as vroom/antivroom/nodereaper/zombie accounts were stripped out first.
Does that address the anonymity concern?
Code written by xdg and posted on PerlMonks is public domain. It is provided as is with no warranties, express or implied, of any kind. Posted code may not have been tested. Use of posted code is at your own risk.