You release every 4 to 6 months? Are your responsibilities unclear? Does each release bring anxiety? Is it always late? Does it involve a lot of manual work? Does it break production and require bug fixes? Maybe even rollbacks? Are stakeholders frustrated? Are users angry?

Welcome to the «Release Hell Show»!

The more frightened you are about a release, the bigger it will be and the more chances it has to cause problems. This will lead to even more concern for the next release. This vicious circle can be a nightmare for development teams.

The good news is that you can escape this “Release Hell Show” without any big bang! By following a simple and structured approach, you can transform your releases into reliable and stress-free events. This post will guide you through the process, helping you regain stakeholders’ trust and measure your successes.

We’ll take the example of a project I’ve have worked on lately. When I arrived on this mission as a freelancer, three developers had just resigned. The only remaining developer didn’t know much about the release process. My first task was to release the 1.27 version, which had been developed for the last four months. I didn’t know anything about the project; there were many problems and challenges, yet I managed to achieve continuous delivery within a few months.

I divided the method I used into three parts:

  1. document: assess the current process and identify the problems
  2. simplify: make my life easier as a “release manager”
  3. automatize: aim to remove every manual action

Ready to escape the «Release Hell Show»? Let’s dive in!

What’s a release?

Before detailing every process stage, let’s ask ourselves, «What’s a release?» It may seem like a stupid question, but try asking the people you work with. You may be surprised by their answers.

The 1.27 example

Let’s get back to our example. When I deployed version 1.27 to production, everything broke! Everything! I rolled back to the previous version, 1.26.5. And it kept crashing… Here for a week, and I smashed the production environment! I promise I’ve never been so scared in my career. We managed to fix the problem after several hours of debugging, sweat, and fear. We attempted to release the 1.27 three more times before succeeding. Each release had the same amount of debugging, sweat, and fear. Each new attempt made me discover a new problem. Four rollbacks for one release is still a record for me. I hope I’ll never beat it.

I won’t detail the resolutions here; that’s not important. What’s meaningful is to understand the root causes of the problems :

  • somebody had previously pushed some hotfixes to production, but he didn’t merge back the code in the v1.27 branch
  • same for some SQL hotfixes
  • a release A could be pushed to staging, and then a version B would created. The versions A + B would be put directly to production without knowing if version A was working in production
  • a release could be pushed to production without any matching tag
  • a release A could be pushed to production, a bug could be detected, then fixed, but the new version was still called “release A

These problems came from the definition of what’s a release. For many, a release was just a “number”, including features or bug fixes. “We need to release the 1.27”. “We fixed a bug; we need to release the 1.27 again”.

A Release Definition

For me, a release is a frozen state of the application at a given time. If I had to imagine a formula for a release, it would be :

The is a unique and consistent number identify what you release. You can use semantic versioning, a build number, or whatever you want as long as this is logical.

The is a way to identify the code related to your version number. It can be a git tag or commit hash. I like to use a git tag to match the version number perfectly.

The is what you’ll deploy on your environments. It can be a Docker image or even an archive file (ideally, we don’t want an archive, but sometimes, we don’t have the choice). The artifact is immutable and must contain everything that’s needed to deploy. Ideally, we want it to include at least the code ready to run, the database migrations, the infrastructure code, the documentation, the assets, and the changelog.

The implies you tried to push your release to every environment you need. You can’t skip staging if you have such an environment. And you can’t consider the release is done until it has reached production.

In a perfect world, the and the should relate to both production and infrastructure code. But when development and ops teams are separated, it can be difficult to make them match. Do your best on that matter. Care only about what you own now (we’ll get back to that later).

Using such a definition implies :

  • if you detect and fix a bug, it’s a new release
  • if you add a new feature, it’s a new release
  • if you want to push a tiny technical improvement, it’s a new release
  • if you change a tiny environment configuration, it’s a new release
  • if you update the documentation, it’s a new release
  • if you detect a bug in the staging environment and you need to fix it before going to production, it’s a new release

You get the idea… Any tiny changes you make after building your artifact result in a new release.

A Release Occurrence

As a DevOps advocate, I always aim for continuous deployment. But let’s be honest

  1. it requires a level of maturity that only a few teams can afford
  2. not every company really needs to push to production at every commit
  3. what is exciting and common sense for us may be frightening for many other persons

Talk to the developers, the product, and the sales teams, and agree on a release schedule. What would be the ideal release planning? What would be acceptable?

In my project, we aimed to release every week. Going from one release every five months to once a week is a significant improvement. Once this milestone is achieved, continuous deployment is no longer an impossible dream. It’s an enhancement within easy reach.

By the way, be also sure to define the differences between continuous integration, continuous delivery and continuous deployment. You’ll avoid future misunderstandings.

Destination vs path

“Everything you presented is nice, Julien, but the release process is so time-consuming and complex that what you are requesting is impossible :/”

Indeed, with the current process, I agree. Releasing every week was impossible when I arrived on that project.

My definition of a release and its occurrence are the goals to reach. Now, let’s define a clear path to this destination.

Stage 1 - Document

The first step is to document the current release process.

Hmmm, why?

«Why the hell should we document such a mess we want to get rid of?» you may ask. Weirdly, the more complex and broken the release process is, the more useful this documentation is. Despite being the most annoying task of the process, this release guide will be the foundation for a smooth, reliable, and automatic release. It will allow us to identify the people involved and their roles, the tasks to perform, the hot spots, get and share knowledge, and have a global vision.

More importantly, this document must be used as a communication weapon. It must inform the stakeholders of the extent of the problem. Keeping the stakeholders informed is a way to gain time, resources, and peace.

What to document?

This document must contain every task or dependency performed by anybody in the team from the moment you want to release until the code is properly working in production. Don’t try to change or improve the process right now. Assess what’s really happening. Talk to everybody involved and collect information about the current process. If you’re not in charge of the releases, require to assist and note everything what’s going on.

Identify the critical stages and mark every action required for each stage. There is no such thing as “it’s obvious”. Document everything that may seem obvious. Document everything that may seem useless because it’s so tiny. Follow a chronological order; for each action, detail who is responsible for it and its type.

Three different types of actions are exciting. Waiting times or dependencies on third parties are signs of a complex process we’ll have to simplify. Manual tasks are what we’ll want to automatize the most in the near future. Automated tasks may be the parts we’ll be able to rely on soon.

I associate each action with its type with an emoji. Here are some examples of the project I worked on lately:

  • For a manual task, I use 👷. For instance, “A ticket must be created for a release 👷 and an additional ticket must be created in case of a SQL migration 👷.”
  • An automated task is represented by a 🤖. For instance, “Once the release branch is merged, an RPM package is created by the CI job 🤖.”
  • Finally, 🕐 represents a waiting time or a dependency. For instance, “To do this, we must create a merge request 👷 , request a review, and get approval from another developer 🕐.”

Once you have listed every task at the top of the document, add

  • a workload summary to explain the extent of the damages
  • the general workflow so that everybody in the teams understand what is really a release
  • eventually, a section remembering all the problems this release hell triggers (or a link to a document listing them)

Here is an example with the start of my 10-page release guide for the last project I worked on:

Now share…

It’s time to celebrate! You’ve achieved the first step of a significant enhancement. Share this document with your team and get their feedback. Maybe they’ll have in mind something you forgot. Many people will probably discover what releasing really means.

Share also this document with the stakeholders. When I shared it to the CPO, he told me

‒ I knew it was a complex process, but I didn’t think it was that difficult and time-consuming. Thanks, now I understand why the release process is so problematic.

Congratulations, you’ve earned time, and you’ve started to build trust.

…and release!

Once the release guide is kind of “official,” follow it strictly for a couple of releases. Add every step that would be missing and enhance what seems fuzzy. Could somebody else perform a release by following your guide? The goal is to get a global vision of the process and to identify the hot spots.

Don’t wait for the next big occasion to release. For instance, if a bug fix is merged, take the opportunity to release a new version. And yes, the first few releases will take time. But be patient; soon, we’ll be able to enhance the process.

What about the rollback?

Often, when releasing is hell, rollbacking follows the same road. Be sure to write a rollback guide that is as precise as your release guide. Follow the same procedure and detail every task and the people involved.

I didn’t have such a document when I had to perform my first rollback in production for this project. I didn’t make this error twice. The level of serenity this guide brings is way beyond the one-hour writing cost.

An interesting section to add to this document is “When to roll back?” Sometimes, knowing the system does not work as expected is not apparent, especially if you miss monitoring and data, which happens often in legacy situations. So be sure to ask questions about that. It will also trigger interesting discussions about ensuring the release is done correctly, which you can add to the release guide.

By the way, monitoring and alerting are outside the scope of this blog post, but of course, they are considerably important when talking about releasing. If they are missing, be sure to add them as early as possible.

Stage 2 - Simplify

Now that we have a clear understanding and a global vision of what’s happening, it’s time for the fun part: enhancing the process! We don’t want to automate yet; we just need to improve the release manager’s life.

Each new release must now have an enhancement: no need to change everything at once, and no need to spend days or weeks working on that topic. We’ll improve the process step by step. Each enhancement must reduce the time to release, thus allowing it to be released more often.

Ditch the useless

If your release process has been around for years or has been modified by many different people, there is a high chance that many steps are useless. Maybe they were useful at some point, but what about now? Once again, ask questions. If nobody knows what a step is about, try to remove it during the next release and see what happens. Do that for a couple of releases.

For instance, for each deployment, we had to create an Excel sheet listing the manually tested endpoints. It has been like that for years. Why? Nobody knew anymore, and nobody was using this file, apparently. So, I simply decided not to create this listing anymore. And guess what? Nobody noticed. Nobody complained. Perfect, it’s a less tedious task to do.

Enhance step by step

With a working release process containing only what’s useful, we can finally consider improving it. What task takes the most time? Or what operation is the most dangerous? How can you enhance it? Can you drastically simplify the workflow by doing something differently?

Pick just one task, simplify it, update the documentation, and release by following your updated guide. Once done, repeat those questions and improve another step. Do this until the whole process is now made of a few simple steps.

For instance, SQL migration management was the most tedious part for me. I won’t go into details, but roughly:

  1. The SQL scripts used during deployment were not precisely the same as those in pull requests. The release manager had to rewrite all the SQL scripts (migrations and rollbacks) for each new version.
  2. Those migration scripts had to be tested manually on the testing environment, and the rollback scripts were never tested.
  3. Those scripts had to be uploaded as an archive in the deployment ticket. This archive followed a weird and complex format. To be fixed, these problems involved a lot of small changes like:
  • making sure the CI uses the same database schema as the real environments
  • helping the developers to write proper SQL migrations
  • making sure developers include those migrations in their merge requests
  • making sure both the local and CI environments use those migration files
  • making sure a developer can’t merge if a migration or rollback is missing
  • writing a bash script to archive the SQL scripts properly instead of doing it manually

Another example could be the testing of a fresh deployment. Previously, the developer had to manually test half a dozen endpoints to ensure the environment worked correctly. We talked with the QA engineers, and they kindly agreed to build a Postman tests collection. Now, with one click, we knew whether the deployment was successful. In a perfect world, we don’t want to launch those tests manually, but still, for the moment, it’s a small step that simplifies our lives.

Each of these small enhancements took a maximum of a day to complete. We didn’t try to do everything at once. We accepted our imperfect bricks, knowing that we’d enhance them in the coming weeks.

Choose your battles

There are many tasks we’ll consider useless along the road. For instance, uploading an archive containing the migration scripts in the deployment ticket or even opening a ticket to perform a deployment… Seriously, what for? I had no idea, even if I asked questions. In my case, it was related to an external team.

If you can enhance the whole process, then do it. If it’s too difficult to change what’s outside your own perimeter, then be it, accept the quirks. Just try to reduce their impact on your work.

As developers, we can often be utterly autonomous about the delivery. But sometimes, we can’t change how our software is deployed. Achieving continuous delivery is already a beautiful achievement. Be proud of it!

Don’t look for perfection

Many of the steps we improve won’t be perfect. Maybe some will be temporary. Some others should even be deleted with the ideal workflow we have in mind. But it’s OK. If you have something better than the actual or that goes in the right direction, merge it. Don’t wait for perfection. Enhancing each release is a sign you’re going in the right direction. That’s all we need. A “little better now” is better than a “maybe perfect later.”

Communicate regularly

You’ve cut the manual operations from 43 to 29. Nice! The release preparation now takes 2 days instead of 3. Great! You’ve been performing 3 releases in a row without any problem. Awesome!

When you make significant progress, inform the stakeholders. Show your results and explain the pain points you’ve solved without getting technical. Building trust is capital.

Stage 3 - Automatize

After a series of small, continuous improvements, our release process is much simpler. It still requires a human to follow a procedure and launch some scripts, but most of the manual boring tasks have been eradicated. It’s time to automatize to reach our goal: ditch the procedure, achieve continuous delivery first, and continuous deployment then.

Talk to the ops team first

If the persons responsible for deploying are on another team, include them in refactoring the delivery and deployment automation. And if that’s the case, it means your team is only responsible for the delivery. Thus, you have to define a contract with the ops team. In particularly:

  • who is responsible for what
  • what is the artifact of the delivery
  • where should it be stored
  • how ideally should they be aware of when to deploy
  • how are the SQL migrations launched
  • how to inform them when an environment configuration has to be changed

All those questions can occur long before starting to automate, but they definitely need to be answered now.

Talking to the ops team helped simplify the workflow and helped me understand some quirks. We agreed to :

  • the dev team is responsible for providing the release artifact, while the ops team is responsible for deploying it. Both teams will ensure the release works as expected when deployed.
  • keep the current RPM package as a release artifact for now (a change of infrastructure was planned later, which would allow us to deliver a Docker image)
  • store it in a GitLab release, with all other assets required (such as the migrations scripts)
  • define a weekly delivery slot instead of opening deployment requests 2 weeks in advance
  • keep for the moment the manual launch of SQL migrations, but we had the opportunity to present the automatic tool we’d like to use and its benefits

Again, it’s not an ideal workflow, but it goes forward. That’s the most important.

Achieving continuous delivery

Now we know what to deliver and where, we can have fun with the CI workflows! Our goal is to build the complete package needed to deploy in one click (or one command).

The first thing to determine is what will be the trigger for the build of our package. Do you want a human to click on a button somewhere? Do you want to build the package whenever a request is merged into the main branch? On our side, we decided to plug the CI workflow into manually creating a tag on the main branch. I’m in favor of automatizing everything (even the creation of a tag or the changelog update), but again, it’s more than OK to start small and enhance the process iteratively. Show that small automated workflows work and build trust.

Generally, the first step in building the package is launching the tests. If you follow continuous integration, all your changes have already been merged with green tests. But sometimes, concurrent merges can lead to weird behaviors. That’s why launching the tests required to merge again is a secure move.

Next, we can add the actions that are genuinely needed to build our package to the pipelines. Follow your release guide and connect the tasks in the CI workflow. If you still have some manual operations, now is the time to wonder if they are relevant or to automatize them. Again, those steps don’t have to be perfect. If you currently need a 15-line ugly-sh bash script to avoid performing a task manually, so be it. Improve the delivery workflow over time; it isn’t carved in stone.

The most crucial point here is to never explicitly call secret keys or passwords in the workflows. Use your CI configuration to store them. Whether you work with Github, GitLab or CircleCI, they all have a secrets or environment variables manager.

The last step of your workflow should be to store your artifact where you agreed to. The RPM package had to be uploaded to a dedicated server on this project. Even if it wasn’t ideal from our point of view, we didn’t want to change this process the ops team was used to at this moment. On the same server, we uploaded the migration scripts to avoid having to add them manually to the deployment ticket. At least we got rid of another useless manual operation.

Congratulations!

At this point, you’ve reached continuous delivery, which is a huge milestone when you work on a legacy codebase where absolutely nothing is automated or using CI/CD workflows. Celebrate and share your victories :) GG :)

A release history

As a bonus, we added a new job to our workflow a few weeks later. A job responsible for creating a GitLab release. This way, we had a free release history, containing all artifacts and the changelog. It became our single source of truth when questions about what each release was about occurred.

Towards continuous deployment?

Continuous deployment can be achieved by following the same documenting, simplifying, and finally automatizing process. Of course, this process is more challenging than continuous delivery, as it implies changing the production environment. However, you can develop the workflow by targeting the staging environments first.

Once again, the most important thing here is to adopt an incremental mindset and clear communication, especially when ops and dev teams are siloed, as they were for me in this project.

A complete infrastructure change was planned for the coming months, which would allow us to have an elastic infrastructure and reach continuous deployment. With this in mind, we agreed our artifact would become a Docker image instead of an RPM package. The migrations could finally be part of the Docker image and be launched automatically. With the ops team, we added a step in our continuous delivery process to build this image in parallel with the current artifact. Also, we created a new workflow to try to deploy the image on future staging environments. This enhancement wasn’t finished when I left the project but was going in the right direction.

Continuous deployment wasn’t an impossible thing anymore :)

It’s a wrap!

Even in a legacy environment, continuous delivery and continuous deployment are possible. By following a methodical and incremental approach and communicating clearly with your team, the other teams, and the stakeholders, you’ll build trust, solve problems, and enhance the process week after week.

Start by documenting the current process to identify problems. Then, simplify the life of the “release manager”. Finally, automatize the process to eliminate every manual action.

Thanks for having read this very long essay. I hope you liked it. See you :)