June 20, 2020

Welp. You’ve just found out there’s an urgent issue that needs to be addressed. Someone released a bug or maybe there’s an outage with a critical system. Whatever it is, you have a “drop everything” situation. You have yourself a fire.

Step 1: Don’t Panic

Your first responsibility is to keep yourself calm. Your next responsibility is to keep everyone else calm. The first few moments are going to be tense, especially if there is a real user impact. Lots of folks are going to rush in and try to help. Engineers will start working on solutions without fully understanding the problem. There will be duplicative efforts. Ask everyone to take three deep breaths. Good. Now let’s begin.

Step 2: Establish The Correct Communication Lines

There’s going to be multiple lines of communications happening at once. Some of that communication will happen outside of public forums. For example, a few folks might jump on a video call and start talking without letting anyone else know. It’ll be a very difficult task to understand who needs what information in-order to do their job. Because of that, I often find it’s best to keep all communication in a public chat room. If a video call is necessary, post the link publicly to give whoever needs a chance to jump on. After the call, summarize the info and post it back in chat so there is something written to refer back to. Let me say this again because it’s so important. Keep all information and decision making public and written down.

Stakeholders are going to be very worried and while they might not be part of addressing the mitigation, they probably need to be involved in some of the decisions. Establish who needs to be actively informed and who needs to be passively informed. If you’re familiar with RACI, use those principles (though I probably wouldn’t mention the phrase itself as it tends to reek of Process with a capital P). Either you or someone else needs to own frequent updates in the public chat room with clear owners for each action item and a time to the next update. Some people prefer creating a dedicated public room to gather all information. I’m mostly indifferent as long as all information is in one place. It’s always surprising to me how fast information flow can spiral. Suddenly a chat turns into emails and phone calls. There are a few Google Docs floating around. Keeping everything in one place won’t be a small effort.

Step 2: Investigation and Mitigation

Before starting to work on any solutions, determine the impact and what can be done right now to mitigate that impact. Maybe it’s displaying a banner on the homepage? Maybe it’s updating a status site? It depends on the situation. Before making any code changes, determine with 100% certainty what caused the issue. Often times, engineers will rush into a situation without fully understanding the root cause, deploy and fix, and only after the issue is not actually fixed realize they’ve wasted precious time. Establish a timeline for what caused the issue. Some important questions to ask:

  • When did this start?
  • What caused the issue?
  • Is it ongoing?
  • What is the impact to customers?
  • Are there legal implications concerning the next steps?
  • Do we need additional resources that are currently not aware of the issue?

Step 3: Take Action

This is the easiest and most frustrating step. If you’re working on communication, you most likely won’t be part of the group that actually solves the issue. It’ll be very hard to just wait and let someone else work. Fight the instinct to bother them as they work. But continue to ask questions when they come back with status updates. Make sure everyone understands the scope of what they’re working on and stress urgency but not panic. The code should still be tested and reviewed. Any system changes should still follow the normal rollout process.

It’s important to take smart shortcuts but not cut dangerous corners. A smart example might be reverting the change instead of finding a fix that will stop the bleeding. A dangerous cut corner could be pushing a code change without testing it. Your job here is to keep everyone thinking clearly and avoid additional damage caused by a stressful situation.

A common fix is running a script directly against a database. I think there are arguments to be made for and against this kind of action. I believe certain situations make this unavoidable. There are ways to be smart here. If you’re running a script that touches a large number of records, start with a handful of records and verify the change was successful. Backup your database first. Keep a record of every script that was run and when. Make sure that every script is reviewed by a second person (usually I prefer at least two people).

Step 4: The all-clear

As the situation starts to clear, continue to provide updates to stakeholders. Continue to ask whether there are any additional steps that need to be taken. Be aware that this is a stressful situation for many people and be considerate when some people continue to ask questions that have been answered already. Continue to write down everything.

When all action items are done, it’s important to give an official all-clear. It really sucks when the team is just hanging around unsure of what to do. Make sure you voice appreciation for everyone helping out (whether or not they were part of the original cause). Your team worked hard that day and should be acknowledged for it. Many of my favorite moments at work have come during incidents because it’s very much a bonding experience for a team (though you really should wait to laugh about it until after the issue is fixed and some time has passed).

Step 5: Post-mortem

Run a post-mortem. This should never be started during the incident but is critical afterward. Make sure everyone knows this process is not about blame but about making improvements based on the learnings. I like this template from Pager Duty but really any document with a timeline, steps taken and future action items is fine. Keeping this conversation away from blame is key. Remind everyone that when something like this happens, there are multiple layers of failures. It’s easy to blame an engineer for some bad code. But what about the reviewer? What about the previous engineers who never built-in safeguards? What about management who didn’t prioritize training and education for this kind of issue? What about stakeholders who push engineers to move faster and don't want to hear about edge cases? There is more than enough blame to go around. Let’s be productive and think about solutions instead.

Keep the post-mortem factual and friendly. This is a dangerous meeting where one wrong thing said could really harm team chemistry. Start by thanking everyone again. Make sure to acknowledge folks who might not have been as visible like customer support or legal. Brainstorm action items that will help prevent or mitigate issues like this in the future. Assign owners and time-frames for each item. Not every item actually needs to be assigned. Generally, there will be a handful of items that have the highest value. Focus on those. A lot of frustrated people who just experienced a really bad day will throw out some ideas you will think are not great. Let them. Write everything down. If it’s not valuable or achievable, you can find time later to talk it over.

The post-mortem is really just the beginning. It’s your job as a leader to keep everyone accountable for follow-ups and make sure work is prioritized to ensure this kind of incident never happens again. In fact, this is where most managers earn their paycheck. Everyone is usually all positive right after an incident but as time passes, other priorities start to bump incident followups out of the To-Do column. You must keep pressing the team to prioritize this work. Keep reminding your stakeholders about the potential impact of a future incident like this happening a second time. Do they want to go through this again? It's your job to make sure they don't.