The Joy of Legacy Code
Have you ever played a game called Jenga? The idea behind Jenga is that you start by making a tower of blocks. Each player removes a block from somewhere in the tower, and moves it to the top of the tower. The top of the tower looks tidy, but it's very heavy and the bottom of the tower is growing more and more unstable. Eventually, someone's going to take away a block from the bottom and it'll all fall down.
I came into Perl development quite late, and I saw a very intricate, delicate interplay of ideas inside the Perl sources. It amazed me how people could create a structure so complex and so clever, but which worked so well. It was only much later that I realised that what I was seeing was not a delicate and intricate structure but the bottom end of a tower of Jenga. For example, fields in structures that ostensibly meant one thing were reused for completely unrelated purposes, the equivalent of taking blocks from the bottom and putting them on the top.
-- The Tower of Perl by Simon Cozens
The perl5 internals are a complete mess. It's like Jenga - to get the perl5 tower taller and do something new you select a block somewhere in the middle, with trepidation pull it out slowly, and then carefully balance it somewhere new, hoping the whole edifice won't collapse as a result.
The Joy of Legacy Code. I'm sure you've all got your own war stories of legacy code that has grown and grown until it resembles the delicate and fragile Jenga tower lightheartedly described by Cozens and Clark above. Not even Perl Monks has been spared:
The problem isn't an infrastructure issue, however -- and speaking as one of the handful of people who've had a hand in developing the site's software: It's our own gosh-darn fault. Perlmonks is WAY more complex than when it originally launched. It does a crapload of perl evals and sql queries per page. It's vulnerable to resource hogs. Searching can cripple the database. And right now, I don't think we're gonna fix these problems any time soon. ... It's not a matter of computer resources, as much as human engineering resources.
Will Rewriting Help?
Netscape 6.0 is finally going into its first public beta. There never was a version 5.0. The last major release, version 4.0, was released almost three years ago. Three years is an awfully long time in the Internet world. During this time, Netscape sat by, helplessly, as their market share plummeted. It's a bit smarmy of me to criticize them for waiting so long between releases. They didn't do it on purpose, now, did they? Well, yes. They did. They did it by making the single worst strategic mistake that any software company can make: They decided to rewrite the code from scratch.
It's important to remember that when you start from scratch there is absolutely no reason to believe that you are going to do a better job than you did the first time. First of all, you probably don't even have the same programming team that worked on version one, so you don't actually have "more experience". You're just going to make most of the old mistakes again, and introduce some new problems that weren't in the original version.
Now the two teams are in a race. The tiger team must build a new system that does everything that the old system does. Not only that, they have to keep up with the changes that are continuously being made to the old system. Management will not replace the old system until the new system can do everything that the old system does. This race can go on for a very long time. I've seen it take 10 years. And by the time it's done, the original members of the tiger team are long gone, and the current members are demanding that the new system be redesigned because it's such a mess.
As indicated above, a grand rewrite is not necessarily the answer. Indeed, I've seen nothing but disaster whenever companies attempt complete rewrites of large working systems.
Apart from the daunting technical difficulties of performing the large rewrite, there's often substantial cultural resistance to replacing established software, even when the rewrite goes smoothly and introduces significant improvements. Examples that spring to mind here are: GNU Hurd replacing Linux; CPANPLUS replacing CPAN; Module::Build replacing ExtUtils::MakeMaker; Python 3 replacing Python 2; and Perl 6 replacing Perl 5 ... though, admittedly, Subversion and git seem to have faced little resistance from diehard CVS users.
That's not to say it can't be done though. The great Netscape rewrite (ridiculed by Spolsky above) -- though a commercial disaster -- metamorphosed into an open source success story. Another example of a successful rewrite, pointed out by tilly below, is the Perl 5 rewrite of Perl 4.
What About Refactoring?
Well, if rewriting won't help, what are we supposed to do? We surely need to provide a glimmer of hope for those poor souls condemned, day after day, to anxiously poking at a terrifying Jenga tower. Maintaining such a tangled mess is cruel, inefficient, and ultimately unsustainable for any business.
The only humane option left that I can see is to relentlessly refactor legacy code, subsystem by subsystem, continuously and forever. To always keep it clean. To prevent it becoming a tangled tower in the first place. Though such an approach seems sensible to me, it can be politically problematic to gain funding for such an endeavor. Apart from the difficulty of justifying the return on investment of such work, you further incur an opportunity cost in that time spent refactoring old code is time not spent developing new products and new features.
Habitability is the characteristic of source code that enables programmers, coders, bug-fixers, and people coming to the code later in its life to understand its construction and intentions and to change it comfortably and confidently. Habitability makes a place livable, like home. And this is what we want in software -- that developers feel at home, can place their hands on any item without having to think deeply about where it is. It's something like clarity, but clarity is too hard to come by.
Like Richard Gabriel, I prefer to aim for the more pragmatic "habitable code" rather than some perfectly abstracted ideal. And I admire Robert C Martin's homespun advice of "follow the boy scout rule and always leave the campground cleaner than you found it" because this simple rule gives hope to the maintenance programmer that things will improve in the future. I'm interested to hear of any tips you may have to motivate and make life more enjoyable for the maintainer of awful old legacy code.
Unit Testing Legacy Code
For many years, I've argued passionately for the many benefits of Test Driven Development:
- Improved interfaces and design. Writing a test first forces you to focus on interface. Hard to test code is often hard to use. Simpler interfaces are easier to test. Functions that are encapsulated and easy to test are easy to reuse. Components that are easy to mock are usually more flexible/extensible. Testing components in isolation ensures they can be understood in isolation and promotes low coupling/high cohesion.
- Easier Maintenance. Regression tests are a safety net when making bug fixes. No tested component can break accidentally. No fixed bugs can recur. Essential when refactoring.
- Improved Technical Documentation. Well-written tests are a precise, up-to-date form of technical documentation.
- Debugging. Spend less time in crack-pipe debugging sessions.
- Automation. Easy to test code is easy to script.
- Improved Reliability and Security. How does the code handle bad input?
- Easier to verify the component with memory checking and other tools (e.g. valgrind).
- Improved Estimation. You've finished when all your tests pass. Your true rate of progress is more visible to others.
- Improved Bug Reports. When a bug comes in, write a new test for it and refer to the test from the bug report.
- Reduce time spent in System Testing.
- Improved test coverage. If tests aren't written early, they tend never to get written. Without the discipline of TDD, developers tend to move on to the next task before completing the tests for the current one.
- Psychological. Instant and positive feedback; especially important during long development projects.
So I was at first enthusiastic about the approach recommended by Michael Feathers in Working Effectively with Legacy Code, namely to (carefully) break dependencies and write a unit test each time you need to change legacy code, thus gradually improving the code quality while organically growing a valuable set of regression tests.
The Legacy Code Dilemma: When we change code, we should have tests in place. To put tests in place, we often have to change code.
-- Michael Feathers (p.16)
Feathers further catalogues a variety of dependency breaking techniques to minimize the risk of making the initial legacy code changes required to unit test.
Though I've had modest success with this approach, there's one glaring omission in Feathers' book: how to deal with concurrency-related bugs in large, complex event-driven or multi-threaded legacy systems. Unit testing, by its nature, is not helpful in this all too common scenario. Overcoming this well-known limitation of unit testing ain't easy.
Unit Testing Concurrent Code
Test-driven development, a practice enabling developers to detect bugs early by incorporating unit testing into the development process, has become wide-spread, but it has only been effective for programs with a single thread of control. The order of operations in different threads is essentially non-deterministic, making it more complicated to reason about program properties in concurrent programs than in single-threaded programs.
See the "Testing Concurrent Software References" section below for more references in this active area of research. Though I haven't used any of these tools yet, I'd be interested to hear from folks who have or who have general advice and tips on how to troubleshoot and fix complex concurrency-related bugs. In particular, I'm not aware of any Perl-based concurrent testing frameworks.In practice, the most effective, if crude, method I've found for dealing with nasty concurrency bugs is good tracing code at just the right places combined with understanding and reasoning about the legacy code, performing experiments, and "thinking like a detective".
One especially useful experiment (mentioned in Clean Code) is to add "jiggle points" at critical places in your concurrent code and have the jiggle point either do nothing, yield, or sleep for a short interval. There are more sophisticated tools available, for example IBM's ConTest, that use this approach to flush out bugs in concurrent code.
In our ongoing "debate" on TDD, Bob and I have discovered that we agree that software architecture has an important place in development, though we likely have different visions of exactly what that means. Such quibbles are relatively unimportant, however, because we can accept for granted that responsible professionals give some time to thinking and planning at the outset of a project. The late-1990s notions of design driven only by the tests and the code are long gone.
While Kent Beck's four rules of simple design, namely:
- Runs all the tests.
- Contains no duplication.
- Expresses all the design ideas that are in the system.
- Minimizes the number of entities such as classes, methods, functions, and the like.
- Learn from prior art. Use models and design patterns. Most designs should not be done from scratch. It's usually better to find an existing working system and use it as a starting model for a new design.
- Define sound conceptual models and domain abstractions. Unearth the key concepts/classes and their most fundamental relationships.
- Aim for balance. Avoid over-simplistic, brittle and inflexible designs. Avoid over-complicated bloated designs with too much flexibility and unneeded features. Be sufficient, not complete; it is easier to add a new feature than to remove a mis-feature.
- Plan to evolve the design over time.
- Design iteratively. Some experimentation is essential. Look for ways to eliminate ungainly parts of the design.
- Use a combination of bottom-up and top-down approaches.
- Apply Separation of Concerns and the Law of Demeter.
- Systems should be designed as a set of cohesive modules as loosely coupled as is reasonably feasible.
- Systems should be designed so that each component can be easily tested in isolation.
- When in doubt, or when the choice is arbitrary, follow the common standard practice or idiom.
- Avoid duplication (DRY).
- Declarative trumps imperative.
- Use descriptive, explanatory, consistent and regular names.
- Hide implementation details. Reflect the user mental model, not the implementation model.
- Reserve the best shortcuts for commonly used features (Huffman coding).
- Establish a rational error handling policy and follow it strictly. Document all errors in the user's dialect.
- Interfaces matter. Once an interface becomes widely used, changing it becomes practically impossible (just about anything else can be fixed in a later release).
- Design interfaces that are: consistent; easy to use correctly; hard to use incorrectly; easy to read, maintain and extend; clearly documented; appropriate to your audience.
- Apply the principle of least astonishment.
- Consider the design from the perspectives of: usability, simplicity, declarativeness, expressiveness, regularity, learnability, extensibility, customizability, testability, supportability, portability, efficiency, scalability, maintainability, interoperability, robustness, concurrency, error handling, security. Resolve any conflicts between perspectives based on requirements.
A project has many stakeholders, each making an investment (time, money, effort) into the project. Each will have different goals for the solution, and they may measure value differently. The Agile Architect's goal is to deliver a solution which best meets the needs and aspirations of all the stakeholders, recognising that this may sometimes mean a trade-off. The Agile Architect must work in a way that makes the best use of the various resources invested in the project.
The solution must be seen as part of a whole, which includes other systems and projects. It must be robust enough to be changed and extended over time. You must support further work, whether it is to change the solution or simply to operate it efficiently.
The cost of change is significant in any major real-world system, so the Agile Architect must balance planning for change against other goals. The Agile Architect must also seek to manage and minimise complexity, which helps to maximise stakeholder value. The aim is a solution which is neither simplistic and brittle, nor over-complicated by over-building for flexibility.
Schwaber's Legacy Core/Infrastructure Catastrophe
In a 2006 Google Tech Talk, Ken Schwaber stated that a chronic legacy core or infrastructure problem existed with every single organisation he helped implement Scrum.
Unfortunately as I've been helping organisations implement Scrum, I've run into a very common problem with every organisation. What these organizations have is a problem called Core or Infrastructure software. This core functionality has three characteristics:
- Fragile; if I changed one thing in that core piece of functionality, it tended to break other things.
- No good test harnesses around it. So if you went in and broke something, you tended not to know about it until it was up on all the servers and then your customers would let you know about it. That's not good.
- Only a few engineers know how to work on it. There were only a few suckers left in the entire company who still know how to and were willing to work on the infrastructure. Everyone else had fled to newer stuff.
Ken continued with a specific anecdote highlighting the strain this core architecture constraint puts on a Scrum cross-functional team:
I remember one company that has about 120 engineers, developers of all kinds of whom 10 are still able to work on the core functionality. The other 110 are working on new stuff. We brought all the engineers into the room. We said, okay, the product manager for the first area and the lead engineer for the first area come on up here. Now select the people you need to do this work over the next month, including, of course, the core engineers. And they did and we said, okay, now leave, get out of here and start working. ... when we got to the fifth product manager and the lead engineer and they said we can't do anything. There's no core engineers left. We looked around the room and there were 60 engineers left. They were thoroughly constrained by the core piece of functionality.
If you have enough money, you rebuild your core. If you don't have enough money and the competition is breathing down your neck you shift into another market or you sell your company. Venture capitalists are into this now, buying dead companies. Design-dead software.
This anecdote rings true with my experience; I've worked at many companies where the original authors of critical core software had long left the company, few folks understood it, and noone dared touch it.
How Does it Happen?
Say you've got a velocity of 20. But product management want more stuff. And so, that's going to require, because that's more stuff, that's going to require that you have a velocity of 22 to do it. Well, gees, how are you going to get a velocity of 22? Are you going to be smarter when you wake up? Are you going to put in new engineering tools? No, none of that will work. So, what you'll actually do to get the increased velocity is of course cut quality, because if you remove quality, you can do more crap, right?
Now if you do this and that release goes out on time, some grumbles from the customers you know, whatever. But customers always grumble and the product manager is promoted, you know, drives a new BMW, parks in one of the fancy spots.
The next release that you start because you're working from a slightly worse code base with clever tricks in it, unrefactored code, no tests -- the best velocity you can really do is 18. Well, that's no good and noone's going to get promoted for that. So the product management team comes down and says, guys you just gotta do it. So you cut quality again but this time when you cut quality, the best you can do is 20 because you're starting from a worse code base. Now it takes about five years, release by release, for you right here to build your own design dead product.
It's got two aspects to it. One is, when we are told to do more, we cut quality without telling a soul. It's just second nature. I have trained over 5500 people and put them through an exercise like this, but very subtle, very sneaky, where push comes to shove and they have a choice of saying, well, we can't do it, or saying we'll do it and cutting quality. Only 120 of the 5500 said no. All of the others just cut quality automatically. It's in our bones. The other part of this habit is product management, them believing in magic, that all they have to do is tell us to do something and, this is the illusion we support, by cutting quality, it'll get done.
And these are what's called good short-term tactics. These are horrible long-term strategies because it's a back-your-company-into-a-corner strategy.
While Ken's plausible explanation of how this happens spookily reminds me of some of my commercial experiences, there are doubtless other ways it can happen. After all, to the best of my knowledge, no Perl 5 pumpking has ever been offered a BMW as an inducement to get a release out early.
A Mythical Perl-based Commercial Company
For fun, and to better understand why this sort of thing happens, let's consider what might transpire if Perl 5 or Perl 6 formed the crucial core software of a commercial closed-source company writing customer-facing software in cross-functional Scrum teams. In this scenario, Perl is an internal tool; the customer doesn't know or care about it, they just want a system that satisfies their needs.
I speculate that most developers and product managers in such a mythical Perl 5-based company would go for the BMW by working on new pure Perl 5 products because their velocity would likely be an order of magnitude higher when writing new Perl 5 components than when changing the underlying Perl 5 C core. Not only that, but hiring expert C programmers with sufficient skill, intelligence, and tenacity to change the Perl core would likely prove to be a significant constraint. So I predict that in such a mythical commercial company, development of the Perl 5 C core would slow down, with only critical bug fixes applied.
Despite Ken Schwaber's dire predictions of "design-dead companies" rapidly going out of business, I see this company as commercially viable for quite a few years (though not indefinitely) because the Perl 5 C core is stable and proven, with very few critical bugs, and, most importantly, is well decoupled. That is, you can write new Perl 5 code without needing to understand anything of the Perl 5 implementation. And teams writing in Perl 5 are likely to be very competitive in the commercial marketplace when competing against companies writing in C, for instance. Such an approach, however, cannot be sustainable in the long term and sooner or later you'll need to untangle your legacy code or rewrite it.
Because Perl 6 is less mature and still evolving, the velocity of teams using it to deliver customer-focused software is likely to be much lower than for Perl 5 teams. That is, the team may be happily and productively writing new Perl 6 code ... then hit an impediment that requires them to switch context and add a new feature or make a bug fix to the Perl 6 core. Team context switches like this are very harmful to team velocity in my experience. This Perl 6 scenario is much closer to most commercial organizations today because their core software is typically incomplete and still evolving. Indeed, agile proponents encourage you to avoid the waste of writing customer software that is never used with slogans like Do the simplest thing that can possibly work and YAGNI.
In summary then, to circumvent Spolsky's "Netscape Rewrite Disaster" and sidestep Schwaber's "Legacy Core/Infrastructure Catastrophe", companies must continuously refactor to keep their core software in a clean and maintainable state. Such unrelenting and diligent work requires formidable discipline however, and few companies have the long term perspective and the will to do it.
Other Articles in This Series
- Nobody Expects the Agile Imposition (Part I): Meta Process
- Nobody Expects the Agile Imposition (Part II): The Office
- Nobody Expects the Agile Imposition (Part III): People
- Nobody Expects the Agile Imposition (Part IV): Teamwork
- Nobody Expects the Agile Imposition (Part V): Meetings
- Nobody Expects the Agile Imposition (Part VII): Metrics
- Nobody Expects the Agile Imposition (Part VIII): Software Craftsmanship
- Nobody Expects the Agile Imposition (Part IX): Culture
- Ken Schwaber, Google tech talk on Scrum, Sep 5, 2006
- The Tower of Perl blog by Simon Cozens
- Nicholas Clark comparing perl's internals to Jenga
- Joel Spolsky on not rewriting
- The early history of Perlmonks
- Site seems slow. Is this normal?
- Site facelift?
- Perl 5 interpreter
- On Coding Standards and Code Reviews
- On Interfaces and APIs
Agile Architecture References
- Lean architecture by James Coplien and Gertrud Bjornvig
- Agile Architecture by Scott Ambler
- The Agile Architect
- Agile Modeling
- Agile Architecture podcast by Grady Booch
Legacy Code References
- Swallowing an elephant in 10 easy steps
- Dealing with sloppy code
- Becoming familiar with a too-big codebase?
- Analyzing large Perl code base.
- Understanding Chaos
- OT: Rewrite or Refactor?
- Strategies for maintenance of horrible code?
- What is the best way to add tests to existing code?
- characterization tests
- Working Effectively with Legacy Code by Michael Feathers
- Clean Code by Robert C Martin et al
- Perl Medic: Transforming Legacy Code by Peter J. Scott
- Object-oriented Reengineering Patterns book now available as a free download
- Software archaeology
- Legacy system
Testing Concurrent Software References
- Testing Concurrent Programs
- Chess: Tool for finding bugs in concurrent programs
- stack overflow question on testing concurrent software
- Intel Parallel Inspector
- Java Concurrency in Practice
Updated 23-jan-2011: Removed reference to Windows NT rewrite plus minor wording improvements.