http://www.perlmonks.org?node_id=822315

EDIT: Added additional example of why "good idea" upgrades may lead to bad results (thanks, crashtest).

EDIT: Added section on blind upgrading as bad practice (thanks, chromatic).

EDIT: Final clarification on making it usable vs. compensating for negligence, and on trust.

Introduction

This started as a reply to WWW::MECHANIZE Get Errors Ending Program Early? but I realized that it was better moved here. First, this is not a criticism of WWW::Mechanize nor its many contributors nor of Andy Lester, its original author. WWW::Mechanize is a great module and truly valuable. This is a meditation on good programming support practices to introduce necessary yet incompatible changes

The story

I love WWW::Mechanize to bits. It's immensely useful. It makes things so easy to do. But it has one wart in its release history, and that is the change that made autocheck on by default.

For those not familiar with the history of this module, the original version required you to check the status of your calls: you would need to say

my $mech = WWW::Mechanize->new(); $mech->get("http://my.server.com/my/path"); if (! $mech=>success) { # attempt to recover }
This was the default for many releases. It was in keeping with many of Perl's APIs: if you do something that might fail, it's your job to check it.

As time went on there was a good cultural shift in the Perl community: from "here's the gun, try to aim away from your foot" to "programmers shouldn't have to remember to keep themselves out of trouble if we can do it automatically". If there is trouble and the programmer isn't checking, you need to do something blatantly obvious to either signal the failure or encourage the programmer to catch the error. This is a great idea, and a great goal. Programmers need all the assistance their support code can give them. Letting me know I have a problem is a good thing.

However, there are two parts to this. One is providing tools which help out this way - this is perhaps the simpler part of the problem. The other, and more difficult, one is getting these tools in the hands of the programmers in a non-disruptive manner. The 1.49_01 release of WWW::Mechanize is an example of doing the first part very well, but not the second.

For release 1.49_01, the Changes file says

THINGS THAT MAY BREAK YOUR CODE
The autocheck argument to the constructor is now ON by default, unless WWW::Mechanize is being subclassed. There are so many new programmers whose ->get() calls fail unchecked that I'm now putting on the seat belts for them.
An excellent idea. A step in a good direction. Yes, this is classed this under "THINGS THAT MAY BREAK YOUR CODE". However, and this is important: 1.49_01 was a point release. Generally speaking, point releases introduce small improvements and backward-compatible changes. Developers can install point releases and not worry about things breaking.

This was a good change, in that it worked right, and it added significant value ... for new WWW::Mechanize users. For people already using WWW::Mechanize at $PREVIOUS_WORKPLACE, it was a serious problem.

What happened?

Code that had been working properly for several years prior to this change suddenly broke. Remember that WWW::Mechanize first shipped in 2002, and this change happened in 2008. This was the first change that didn't make your code die with a real Perl error if you ignored an incompatible change. Worse, the way in which it broke did not point out what the problem was. It was diagnosed accurately (as in, what happened was shown properly):
Error GETing http://your-uri: reason
but not completely: there was no indication why the failure was happening now, or how it needed to be fixed. It required careful reading of the man page to spot that you had to make a change to your new to turn off autocheck unless you were subclassing, in which case things worked the way they had before. Many folks using this module had it installed by someone else and did not see the Changes file that pointed out the potential problem.

For us, there was a period of many months in which we'd get yet another phone call saying "my XXX script that gets stuff from the Web is broken" and we'd have to explain again about autocheck and how to fix it.

Why was this a surprise?

Perhaps we'd become complacent. Previous changes to WWW::Mechanize that might be code-breakers were generally making things that you should have been doing anyway mandatory, or removal of deprecated features that were already generating deprecation warnings.

One change, in handling headers, was done in the switch from the 0.xx version to the 1.00 version. This was a change that required code fixes not related to deprecations, and was not unexpected because of the version bump.

All of these produced errors that made diagnosing the failure pretty simple: no such method, or no such variable (in the case of the header change).

Automatic autocheck was a surprise and a problem because

  1. there were no deprecation warnings leading up to the change: warn "a future version of WWW::Mechanize will require you to turn off automatic status checks in new()", for example.
  2. the version number did not change: the last release without autocheck on was 1.34. The release with it on was 1.49_01. Note that the last API-breaking change similar to this was the 1.00 change in 2004. This led people to expect it would be just another release with no significant functional changes.
  3. the resulting error message did not make it clear what the problem you now had was, and what was needed to fix it.
  4. the documentation did not emphasize that there was a change that would cause a common usage of the module to fail, and that a specific change was needed to prevent this.
  5. there was no easy way to globally restore the old function; it was "scan your code, find the new calls, and fix them".

How could it have been better?

There are several possible ways that this could have worked better, any one of which would have made this change less disruptive: Doing more than one, or even all of them would have been great: the more possible ways you can prevent a problem, the better.

So why wasn't one of these chosen? There are a number of possible reasons. Sometimes people just make mistakes. Sometimes they forget that others don't have the benefit of their point of view - if you're developing code that you know has had a particular change, you tend to forget that other people weren't there when you made it. It's very difficult sometimes to remember that other people who'll be using your code don't work in the same context that you do, and that you need to make extra efforts to ensure that everyone's starting at the same point when thinking about a problem.

What to learn

If you're working closely with code, it's easy to forget that you know that there's been an incompatible change, that you know how to spot it and fix it, that you know where you documented it. For someone who's not familiar with the change, all this may be mysterious and inexplicable and hard to spot. You need to think ahead about your user base and what they'll need to be up to speed with your change as effortlessly as possible.

Make sure that all of your documentation emphasizes the change: changelog, manpage, checkin comments, error messages (yes, error messages are documentation too!).

Put the upcoming change out on your blog, on Twitter, on Perlmonks, on any relevant mailing list. Do it more than once.

Do everything you can to spot incompatible code in your API and diagnose the error as early as possible (it's better to die in new than it is to wait until an error condition arises because something wasn't done much earlier).

Provide a backdoor such as an environment variable to re-enable the old behavior - it's perfectly reasonable to make this something that emphasizes that the code needs to be fixed rather than worked around:

if ($ENV{WWW_MECH_IGNORING_AUTOCHECK_REQUIREMENT}) { ... }

In short, make it obvious that there's a problem and what it is, and make it obvious what needs to be done to fix it both short-term and long-term; if you can, give people a way to fall back to old behavior temporarily while they get changes in place for the new.

In the long run, making the extra effort to help your users avoid problems will generate a lot of good will, save time, and do a lot for your reputation as a careful and just plain good programmer.

Remember: think before you unconditionally make "good" old code break. If there's no way around it (you've found a really bad bug, or there's a security issue associated with "the standard way" of using your module), you may have to do it, but you absolutely have to communicate that old code will unconditionally break, and show how to fix it.

Well, just upgrading is stupid.

Testing your code and being sure you haven't broken it by upgrading something - absolutely the right thing to do. But this doesn't always happen: what if your ISP decides they need feature X in the new WWW::Mechanize so they upgrade? What if it's late, or you're in a hurry, or distracted, or pressured ... or yeah, just not as smart as you might be? Does this mean you should be on your own if your software ends up broken?

No. As a developer, I have a responsibility to make it easy for a reasonable person to extricate him- or herself from a problem if I can. Everyone has their own version of "reasonable", but mine is the programmer called out of bed at 2AM. Most people are at about half their normal competence when they've just woken up. I prefer to engineer for my stuff to be safe in this situation so that I can't do something stupid to myself. This maybe isn't so much altruism as self-interest. As a developer who wants to write code that is more usable (in the Don Norman sense), this is the difference between "You got this error and here's why and what will fix it" vs. "You got this error" (good luck finding out why and how to fix it) or worse, "You got an error" (good luck figuring out what, why, and how to fix it).

It's also about trust. If it's obvious that the effort went in to try to make the code usable, then the code feels, and is, more trustworthy. The person using the code thinks, "If I make an error, I can trust this code to tell me what it is and and at least give me an idea how to fix it". This is what you want: trustable and usable code. Not code which cossets people to the point that it tries to fix every possible thing that could be wrong, but code that makes the small extra effort to be usable.

Final thoughts

A final example to consider: let's say that your favorite Linux distro decided that there was a fancy new tape driver that was loads more dependable and faster, but which made all your old backups unreadable. Would you want the newest version of the OS to install that without telling you in no uncertain terms that upgrading would make all your tapes unreadable? Would you feel there was a problem with the usability of the code? Would you trust it less?

If on the other hand there was a compatibility mode that would let you read your old tapes, but would only let you write new ones in the improved and upgraded format, you'd feel like you could trust the person who made the decision to have made a good one - because he or she made a good decision in not potentially making valuable data inaccessible without a big investment in time and effort.

Plaudits

I love WWW::Mechanize. I have nothing but the greatest respect for all of its contributors. You've saved me a lot of time and work and I'm grateful.

I hope that you'll excuse my using the consequences of this one choice as an opportunity to talk about good code support practices.