Not everything that can be counted counts and not everything that counts can be counted
What's measured improves
Three recent events got me thinking about software metrics again:
I'm interested to learn:
I've done a bit of basic research on these topics, which I present below.
Software Metric Gaming
Key performance indicators can also lead to perverse incentives and unintended consequences as a result of employees working to the specific measurements at the expense of the actual quality or value of their work. For example, measuring the productivity of a software development team in terms of source lines of code encourages copy and paste code and over-engineered design, leading to bloated code bases that are particularly difficult to maintain, understand and modify.
-- Performance Indicator (wikipedia)
"Thank you for calling Amazon.com, may I help you?" Then -- Click! You're cut off. That's annoying. You just waited 10 minutes to get through to a human and you mysteriously got disconnected right away. Or is it mysterious? According to Mike Daisey, Amazon rated their customer service representatives based on the number of calls taken per hour. The best way to get your performance rating up was to hang up on customers, thus increasing the number of calls you can take every hour.
Software organizations tend to reward programmers who (a) write lots of code and (b) fix lots of bugs. The best way to get ahead in an organization like this is to check in lots of buggy code and fix it all, rather than taking the extra time to get it right in the first place. When you try to fix this problem by penalizing programmers for creating bugs, you create a perverse incentive for them to hide their bugs or not tell the testers about new code they wrote in hopes that fewer bugs will be found. You can't win.
Don't take my word for it, read Austin's book and you'll understand why this measurement dysfunction is inevitable when you can't completely supervise workers (which is almost always).
The anecdotes above are just the tip of the iceberg. I've heard many stories over the years of harmful gaming of metrics. It is clear that you should not introduce metrics lightly. It seems best to either:
Performance Appraisals
At a recent Agile metrics panel discussion, I was a bit surprised that everyone agreed that their teams had some "rock stars" and some "bad apples". And that "everyone knew who they were". And that you didn't need metrics to know!
That's been my experience too. I've found that by being an active member of the team, you don't need to rely on numbers, you can simply observe how they perform day to day. Combine with regular one-on-ones plus 360-reviews from their peers and customers and it is obvious who the high performers are and who needs improvement.
Though I personally feel confident with this process, I admit that it is subjective. I have seen cases where two different team leads have given markedly different scores to the same individual. Of course, these scores are at different times and for different projects. Still, personality compatibility (or conflict) between the team lead and team member can make a significant difference to the review score. It does seem unfair and subjective. Can metrics be used to make the performance appraisal process more objective? My feeling is that it would do more harm than good, as indicated in the "Software Metric Gaming" section above. What do you think?
Software Development Process Metrics
Lean-Agile City runs on folklore, intuition, and anecdotes
-- Larry Maccherone (slide 2 of "The Impact of Agile Quantified")
It's exceptionally difficult to measure software developer productivity, for all sorts of famous reasons. And it's even harder to perform anything resembling a valid scientific experiment in software development. You can't have the same team do the same project twice; a bunch of stuff changes the second time around. You can't have two teams do the same project; it's too hard to control all the variables, and it's prohibitively expensive to try it in any case. The same team doing two different projects in a row isn't an experiment either. About the best you can do is gather statistical data across a lot of teams doing a lot of projects, and try to identify similarities, and perform some regressions, and hope you find some meaningful correlations.
But where does the data come from? Companies aren't going to give you their internal data, if they even keep that kind of thing around. Most don't; they cover up their schedule failures and they move on, ever optimistic.
-- Good Agile, Bad Agile by Steve Yegge
As pointed out by Yegge above, software metrics are indeed a slippery problem. Especially problematic is getting your hands on a high quality, statistically significant data set.
The findings in this document were extracted by looking at non-attributable data from 9,629 teams
-- The Impact of Agile Quantified by Larry Maccherone
Larry Maccherone was able to solve Yegge's dataset problem by mining non-attributable data from many different teams, in many different organisations, from many different countries. While I found Larry's results interesting and useful, this remains a slippery problem because each team is different and unique.
Each project's ecosystem is unique. In principle, it should be impossible to say anything concrete and substantive about all teams' ecosystems. It is. Only the people on the team can deduce and decide what will work in that particular environment and tune the environment to support them.
-- Communicating, cooperating teams by Alistair Cockburn
By all means learn from Maccherone's overall results. But also think for yourself. Reason about whether each statistical correlation applies to your team's specific context. And Larry strongly cautions against leaping to conclusions about root causes.
Correlation does not necessarily mean Causation
The findings in this document are extracted by looking for correlation between “decisions” or behaviors (keeping teams stable, setting your team sizes to between 5 and 9, keeping your Work in Process (WiP) low, etc.) and outcomes as measured by the dimensions of the SDPI. As long as the correlations meet certain statistical requirements we report them here. However, correlation does not necessarily mean causation. For example, just because we show that teams with low average WiP have 1/4 as many defects as teams with high WiP, doesn’t necessarily mean that if you lower your WiP, you’ll reduce your defect density to 1/4 of what it is now. The effect may be partially or wholly related to some other underlying mechanism.
-- The Impact of Agile Quantified by Larry Maccherone
"Best Practices"
There are no best practices. Only good practices in context.
-- Seven Deadly Sins of Agile Measurement by Larry Maccherone
I've long found the "Best Practice" meme puzzling. After all, it is impossible to prove that you have truly found the "best" practice. So I welcomed Maccherone's opening piece of advice that the best you can hope for in a complex, empirical process, such as Software Development, is a good process for a given context. Which you should always be seeking to improve.
A common example of "context" are business and economic drivers. If your business demands very high quality, for example, your "best practice" may well be four-week iterations, while if higher productivity is more important than quality, your "best practice" may be one-week sprints instead (see the "Impact of Agile Quantified Summary of Results" section below for iteration length metrics).
Team vs Individual Metrics
From the blog cited by Athanasius:
(From US baseball): In short, players play to the metrics their management values, even at the cost of the team.Yes, Larry Maccherone mentioned a similar anecdote from US basketball, where a star player had a very high individual scoring percentage ... yet statistics showed that the team actually won more often when the star player was not playing! Larry felt this was because he often took low percentage shots to boost his individual score rather than pass to a player in a better position to score.
Finding the Right Metrics
More interesting quotes from this blog:
The same happens in workplaces. Measure YouTube views? Your employees will strive for more and more views. Measure downloads of a product? You’ll get more of that. But if your actual goal is to boost sales or acquire members, better measures might be return-on-investment (ROI), on-site conversion, or retention. Do people who download the product keep using it, or share it with others? If not, all the downloads in the world won’t help your business.
In the business world, we talk about the difference between vanity metrics and meaningful metrics. Vanity metrics are like dandelions – they might look pretty, but to most of us, they're weeds, using up resources, and doing nothing for your property value. Vanity metrics for your organization might include website visitors per month, Twitter followers, Facebook fans, and media impressions. Here's the thing: if these numbers go up, it might drive up sales of your product. But can you prove it? If yes, great. Measure away. But if you can't, they aren't valuable.
Good metrics have three key attributes: their data are consistent, cheap, and quick to collect. A simple rule of thumb: if you can't measure results within a week for free (and if you can't replicate the process), then you’re prioritizing the wrong ones.
Good data scientists know that analyzing the data is the easy part. The hard part is deciding what data matters.
Schwaber recommends measuring:
Agile Measurement Checklists
Larry Maccherone's Seven Deadly Sins (and Heavenly Virtues) of Agile Measurement:
Hank Marquis Seven Dirty Little Truths About Metrics cautions that metrics must derive from, and align with, business goals and strategies. And metrics should be selected only after understanding the needs the metric addresses.
Further advice from Hank:
Impact of Agile Quantified Summary of Results
Maccherone's results were reported with regard to the following four dimensions of performance:
Three further "fuzzier" metrics (often measured via lightweight surveys) are currently under development, namely:
Stable teams resulted in:
Recommendations:
Estimating:
Recommendations:
Work in Process (or WiP) is the measure of the number of simultaneous work items that are "In process" at the same time.
Teams that aggressively control WiP:
Recommendations:
Small teams (of 1-3 people) have:
Recommendations:
Iteration Length:
Testers:
Motivation:
Co-location:
Other Articles in This Series
References
Update: Added new sub-sections to "Summary of Results" section: Iteration length; Testers; Motivation; Co-location. 23-July-2014 Update: Added new sections: Team vs Individual Metrics, Finding the Right Metrics; 23-Nov-2014: Added Schwaber-recommended metrics.
|
---|