The phone alarm rings, the time is 3 am. I wake up reluctantly from sleep to begin a now unconscious routine: wash my face, grab a cup of coffee and head for my study desk. I start my day with a bible meditation and study session. All goes well until an email alert from a priority account: “ Airflow alert: <Taskinstance: … [failed]>…”

“Oh! not again!”; I exclaim, and the rest of my morning routine goes awry as I struggle to figure out what could have gone wrong.

So, I took up the role of monitoring and maintaining the airflow jobs prepared by the engineering team. Over time, I have encountered too many scenarios like the one described above that challenged my perception of handling failures.

In this article, I will communicate two things:

A mindset change.
A framework for handling failures.

Mindset change

1. Things can go wrong

I used to feel very disappointed when a job I created failed. I thought it was an indictment against my competence. But one of the first things I observed in the more senior members of my team is they were always more relaxed when errors popped up.

Suffice it to say, I understood that Edward Murphy was right. Developing a mindset that things can go wrong helped me achieve two things:

I began to write more failure proof scripts(if you work with redshift you’d understand that this,… well the least said the better)
I was less critical of myself when my scripts failed. I took it as an opportunity to learn more about the system and write better code.

That, things can go wrong, may seem like a pessimistic stance but you are more likely to improve faster by learning from your failures than you would from your successes.

2. No blame games!

Nobody likes to have their daily routine disrupted because some pipelines decided to fail. You are likely to be grumpy and look for someone or something to chew up; fair enough. But it does not do much good.

In the event of a failure, what you do not want to do is look for someone to blame. Instead, focus on; finding out what the incident is, its impact and the resolution.

People naturally become defensive and are likely to absolve themselves of any fault when things go wrong. Playing the blame game increases tension within the team and stops anyone from arriving at the root cause early enough.

Having mindset changes is one of the many internal work you should anticipate doing throughout your career. It provides a faster route to develop.

A framework for handling failures.

One of the many lessons I learned from reading Matthew Syed’s book, Blackbox Thinking was that: The way to improve the efficiency of any system is to learn from the times when the system failed. And that this usually requires that individuals and teams be objective enough to prevent emotions and ego from interfering.

Over time I observed a practice among my engineering team, and I came up with a routine to help me handle the errors much more quickly. The following describes this further:

1. Do a preliminary analysis and gauge the impact of the error.

i. What job failed?

The alert email and the generated logs will usually contain that information. (this assumes the logs are being written to the Airflow UI and that each task in the airflow job has the email_on_failure option set to True)

ii. Are there dependent jobs?

Here you want to quickly find out if other jobs depend on the output generated by the failed job. They could be a tableau report or a python script that handles some extra complex transformations. Finding out the dependent jobs helps you gauge which other systems will be affected.

2. Communicate to stakeholders.

This step is vital as you want to be the first to highlight the problem before the business teams ring the alarm bells. Armed with the information gathered from the previous step, you can quickly draft a short message/email informing them of the error and the systems that are to be affected.

Communicate to your engineering team as well.

3. Deep dive into the incident and work out the fix.

In this step, you identify the root cause and work out a fix. The time spent here depends on the error and perhaps technical skills but be sure to collaborate with a teammate as soon as you are stuck. Don’t forget that the goal is to arrive at the root cause and work out a fix as soon as possible.

When the fix has been tested and applied, communicate back to the engineering team and the business stakeholders.

4. Document, document, document!

I must admit this is the hardest and most mundane step. After resolving the error and applying the fix, the last thing you want to do is prepare a write-up.

But this is a vital component because errors reveal faults. They are like feedback mechanisms that show you where the problem was. By documenting, you expose that new knowledge for posterity. I think that’s why StackOverflow will forever remain a gold mine.

Aim for brevity. The documentation should contain:

What the error was, add screenshots where available.
What was the root cause?
The fix that was applied.
Any recommendation to prevent future occurrence.

In concluding, accepting that things can sometimes go wrong and not looking for someone to blame when they do, can be a game-changer.

So when the error alerts pop up, you can run through this process:

Do a preliminary analysis and gauge the impact of the error.
Communicate to stakeholders.
Deep dive into the incident and work out the fix.
Document, document, document!

I admit this may seem like a lot of work. But, you are likely to gain a deeper understanding of the underlying tech architecture. And perhaps more importantly, this can also reveal to me how deeply connected your work is with the rest of the business.

And as if by some miracle, we now receive fewer error alerts.

I found this article by Kiana Alessandra Villaera on how to mitigate against fixing bugs for Data engineers. Do check it out.

I hope you enjoyed this.

Cheers!!