Postmortem

Postmortem

·

2 min read

Issue Summary

On 14/01/2024 at 06:30 AM UTC+3, a notification from Datadog came in notifying me that the Apache server being monitored might be down. I tried to connect to our website but I got a 500 error. As the on-call site reliability engineer that day, I discovered that the outage was caused by mismatching Apache config file extensions.

Timeline

  • 06:30 AM – Outage began

  • 06:30 AM – The Back-end team was notified

  • 06:33 AM – Connection to website was initiated but a 500 error was returned.

  • 06:35 AM – 06:48 AM: The debugging process began by first checking the last commits of code to try and figure out what change could have caused the outage.

  • 06:50 AM – Changes were pushed to the main code base and the server was restarted.

  • 06:55 AM - The Apache server was back online.

Root Cause and Resolution

The first step was to view the latest commits to the code base to get the root cause of the problem. I saw that five commits were merged into the code base by the back-end team. Having this knowledge, I decided to look at the error logs emanating from the Apache server that is used.

The error log did not reveal the real problem and so strace was executed to find an in-depth review of the possible problem. After strace had been run, it was discovered that there were some typing errors in the file extensions of some config files of the Apache server which were causing the Apache server to respond with a 500 error.

The typo errors were corrected and the changes were pushed after which the Apache server was restarted and tested for functionality which was a success. The server was back online at 06:55 am UTC + 3.

Corrective and preventative measures

To prevent the occurrence of similar problems in the future the following should be adopted:

· Create an automated test pipeline for every update push.

· Create multiple tests for every new update and the teams should not merge the new changes until those tests pass.

· Create a summary of the changes that were done instead of a reliability engineer rolling back changes to determine what might have caused an outage. This would greatly speed up the process of diagnosing the problem and debugging it.

· The monitoring software (Datadog) can be configured to measure multiple metrics to ensure every angle of potential failure is covered.