Postmortem

Daniel Ortega Chaux
3 min readSep 29, 2021

Every company has their own name for their highest priority bugs. Not every software bug is as dramatic and critical, but on the other hand there are some bugs that require the developers to solve it as fast as they can, just like the apache 500 error we will talk about later in the blog. First of all let’s clarify what postmortem is, which is when a team analises what went wrong with something they did and obviously fix it and prevent it to happen again.

Basically post-mortem should be used to stop making the same mistake over and over again, not only that but yo learn every single lesson we can from the issues and bugs we had to make sure they will never happen again.

In conclusion we can say that nothing is more valuable than experience:

Apache 500 error:

Issue Summary:

Two weeks ago on September 14 of 2021, our apache server returned the error 500 for around 3 hours (11 am to 2pm). Every single user was affected by this, so we had the 100% of users really annoyed by that . Users couldnt login into their accounts, appearing instead the error message: “500 internal server error”. The main cause was actually something pretty easy to fix, we found out the root cause of the issue was that somehow the extension of the file was .phpp instead of .php

Timeline:

  • The issue was detected at 11:30 am 9/14/2021 when trying to curl 127.0.0.1
  • The above command returned “Apache gives 500 Internal Server Error”, meaning there was a server-side error that prevents Apache from processing a request and returning a proper response.

root@e514b399d69d:~# curl -sI 127.0.0.1
HTTP/1.0 500 Internal Server Error
Date: Fri, 19 May 2020 11:15:00 GMT
Server: Apache/2.4.7 (Ubuntu)
X-Powered-By: PHP/5.5.9–1ubuntu4.21
Connection: close
Content-Type: text/html

  • First debugging step we did was to check every single command that has been passed with the ‘history’ command
  • Secondly, we restarted the server to validate that the server could restart correctly. (Which it did)
  • Then, we checked the different processes that where currently running, and check if any of those had an issue.
  • After checking the processes we found out there was one that wasnt responding.
  • Finally, when we where going to check the file we found out the misspelled in the extension of the file. So we obviously changed it and the server was fixed at 2 pm of the same day.

Root cause and resolution:

The issue was that there was an invalid name in path: /var/www/html/wp-settings.php

When we found the php not running process we went to change the extension to fix the problem.

# Fix file .phpp to .php

exec {‘fix-file’:
command => ‘sed -i s/phpp/php/g /var/www/html/wp-settings.php’,
path => ‘/bin/’,
}

Corrective and Preventative Issues:

First thing we can do to imporve is to attach a monitoring tool like datadog to the web-servers, and by that monitoring software could send requests to different endpoints on our system and verify that they’re all returning a status code 200 of success.

TODO list to address the issue:

  • Sign up for Datadog
  • Install datadog-agent
  • Monitor some metrics

--

--