When deploying a production change, usually you have a rollback procedure – documented even!
But sometimes after a deploy, things don’t work exactly as expected. At that time, you need to decide which is better: rollback, or fail forward?
If it’s a small, isolated code change and you can operate with the old version, often it’s an easy decision to rollback.
But in more complex situations, sometimes it’s better to accept that the change wasn’t perfect, but:
- is still an improvement overall, and gets you closer to your goal
- or is equivalently bad to the previous situation, but in the right direction
- moves to a new target architecture that can’t be simulated in qa or stage for reasons of time, cost or complexity.
- serves as a commonly-understood stake in the ground, or anchor point. “Now that we’re here, we can see the right direction!”
and keep the change and fail forward instead of doing a rollback.
Generally to know if failing forward is an option, you need:
- enough personal and organizational responsibility to accept the risks and handle the consequences.
- a clear understanding of the overall IT systems and IT risks
- a clear understanding of the overall business systems and business risks
- availability of staff to do verification and fix small issues that arise
- to pick a good time for the change that minimizes stress and risk
- Monitoring and application logging tools help evaluate the situation. (I’d even suggest rounding out your tools inventory beforehand if failing forward is new to you.)
- communicate that you may fail forward if necessary, based on calculated, not reckless, risk assessment and that rollback is still an option.
Some actual examples of when I have failed forward successfully:
- firewall rule changes that were closer to the final goal, but broke a couple of servers temporarily.
- database schema changes that were correct, but required a day or two of minor internal application updates that were not in the original QA test plan.
Some actual examples of when fail forward was not acceptable, and rollback was required:
- changes from httpd 2.0 to 2.4 that actually required significant re-QA and updates to the deploy process
- database schema changes that were correct, but required a major application re-build and re-QA totalling more than 3 hours of downtime
- changes that affected legacy applications with no budget for developers or QA.
Especially with databases, the arrow of time cannot be reversed. So database restores results in the loss of data on busy systems, making fail forward the default policy at many SaaS companies. Additionally, failing forward helps with development velocity.