Every sysadmin, operator, and even app developer has been there. There’s a spike in your dashboard and your #alerts Slack channel is firing off. Something has gone wrong, and you’re about to go into Perry Mason mode to track down the offender.
That’s hard enough in a typical environment, but often in a containerized environment you can’t even access data from the containers that were on the fritz. So you’re stuck with the well established, and well loved solution of killing the container to stop the issue.
The problem now is that your troubleshooting data is gone, and you have absolutely no clue as to the root cause of the alert.
And let’s be honest – that dashboard that spiked? It was a great tip off, but in most cases it won’t hold the answer. Troubleshooting may start with a dashboard notification, but it certainly doesn’t end there.
Before We Begin: Continue reading.