Navigating through failure

Navigating through failure is an inevitable part of the journey for any engineering leader. Maybe you missed the mark or maybe your team did; to be honest, it really doesn’t matter where the failure occurred. No one individual will ever have a perfect track record.

So if you know you’re going to mess up at some point, what should you do? Your energy should of course be partially spent on preventing mistakes before they occur, but where I see leaders thrive or fall apart is how they react in the face of failure. It's not just about the setbacks themselves, but how you rebound from them that can set leaders apart. Let’s talk through a few things I recommend doing.

Embrace the fact that failure’s going to happen and treat it as a learning opportunity. This goes for both you and your team. First off, accept that failure is not the opposite of success; it's a part of it. In a fast-paced tech world, not every project or initiative will pan out as expected. We all know how difficult it is to capture every possible item within a scope of work and to properly estimate it. What matters is how these moments are utilized to foster a culture of innovation and resilience. Acknowledge the failure, dissect it without assigning blame, and focus on the learnings it provides.

Foster a safe environment for taking risks. As a leader, your reaction to failure sets the tone for your team. If you treat setbacks with a sense of curiosity rather than frustration, you encourage a culture where team members feel safe to take calculated risks. I’ve seen some leaders who are either far too quick to blame for a situation that occurred, or if they’re the ones who messed up, they will avoid talking about it altogether. When your team knows that failure won't lead to punitive measures, they're more likely to think outside the box and contribute groundbreaking ideas.

Lead by example. Demonstrate resilience. Share your own experiences with failure, what you learned, and how you bounced back. This not only humanizes you but also reinforces the message that it's okay to fail as long as you learn from it and move forward. Your team will respect and respond to this transparency, fostering a closer, more communicative group dynamic.

Focus on solutions, not blame. When a project doesn't go as planned, it's easy to fall into the blame game. Do whatever you can to resist this. Shift the conversation from who is at fault to what can be learned. It’s okay to pick apart a project and break down the actual chain of events that occurred. When I have to go through this exercise with my team, I often preface these conversations with the fact that we’re looking for opportunities to improve our processes. Is there something I can do better as their manager to ensure a project stays on track? Encourage your team to come up with a post-mortem analysis that focuses on insights and future strategies. This approach not only aids in personal and team growth but also helps in quickly pivoting to the next possible solution.

Apply your learnings and your next moves carefully. Once the lessons have been learned, follow through on these lessons. If you identified process changes, actually enact them. Otherwise the lesson you’re working through with your team is purely lip service and they’re going to notice nothing has changed. Adjust your goals or rethink your approach. Test it out and see what works as a leader and team.

Navigating through failure as an engineering leader isn't about avoiding setbacks; it’s about learning how to leverage them for growth. Failure sucks. None of us like messing up. When this occurs, it’s your chance to really step up as a leader and help your team through the next steps.

The CrowdStrike Incident

What happened at CrowdStrike?

On July 19, 2024, a kernel-level update was rolled out to 100% of Windows machines, causing the dreaded BSOD. To fix, IT admins must boot each machine into safe mode one-by-one and delete a system file.

Infrastructure globally was halted as a result: hospitals, airports, and more are having to navigate around this.

This seems like a pretty colossal failure, no?

Sure is. This is exactly why organizations put up several checks and avoid doing YOLO rollouts (to 100% of customers). (This may be a gross oversimplification of what happened, and we’ll learn more as CrowdStrike updates the general public.)

Look—nobody has a perfect track record. Things are going to break. Some things just break worse than others. As a leader in charge of shipping critical infrastructure, your goal is two-fold:

Test thoroughly!
Have a mitigation plan in the situation where something does break

If failure’s going to happen to some degree, the best you can do is plan for it.

Now, let’s dig into that second point, that we should foster a safe environment for taking risks. There is a key difference between supporting risk-taking and being foolish. Innovation happens in a space where teams feel comfortable making bets. But this isn’t a blanket statement.

I recently had a conversation with one of my engineers about this point in particular. As our company has grown substantially in terms of customer count, our systems are becoming more and more critical to our end users, meaning we need to be even more careful about how we approach releases. Gone are the days where we ship and pray nothing breaks. If customers can’t access their video footage, they may miss a critical event pertaining to a security or safety issue. Risk taking here means giving the team the space to try something new, but in a controlled environment. And, when we feel good about opening the doors to others using a particular new feature or function, we roll out gradually—not 0 to 100.

The last point I’ll make based on the original newsletter is around applying your learnings. Sharing our failures prevents others from being able to make the same mistakes. The post-mortem that comes out of this CrowdStrike incident should be read and reviewed by every team shipping products. While corner case incidents can happen, this feels like a case where we can immediately point out processes we have or should implement to prevent this situation from occurring within our own organization.

Remember: Even if you aren’t building critical-to-life systems at your work, the work you do is important to businesses or customers in your own way and should be taken seriously.

The CrowdStrike Incident

What happened at CrowdStrike?

Infrastructure globally was halted as a result: hospitals, airports, and more are having to navigate around this.

This seems like a pretty colossal failure, no?

Look—nobody has a perfect track record. Things are going to break. Some things just break worse than others. As a leader in charge of shipping critical infrastructure, your goal is two-fold:

Test thoroughly!
Have a mitigation plan in the situation where something does break

If failure’s going to happen to some degree, the best you can do is plan for it.

Remember: Even if you aren’t building critical-to-life systems at your work, the work you do is important to businesses or customers in your own way and should be taken seriously.

Navigating through failure

The CrowdStrike Incident

Get more insights like this

Share this post

Kelly Vaughn

Navigating through failure

The CrowdStrike Incident

Get more insights like this

Share this post

Kelly Vaughn