Runbooks are important because they make knowledge easily actionable for someone without domain expertise. This ensures, for example, that the engineer who created the service doesn’t need to be the first line of defense in the event of an outage. Instead, if they create a runbook, anyone else can pick it up and take the right steps to fix the problem.

Some awesome advice in this article. Same goes for alerts:

  • Make them focused and concise
  • Make them actionable
  • Make them so anyone on your team can use them

As usual, the Hacker News comments have a few really interesting looking articles mentioned as well:

  • This one on SRE for high traffic gambling sides. tl;dr process can be good if you take your time and make sure you’re getting value from it
  • This one about do nothing scripts to ease into automation. I actually thought this would be about writing stubs/shims that would eventually be completed. The author uses these as a sort of guide program that tells you the steps you need to run and then waits. I could see this being an interesting approach for some tasks like app setup… but then I’m probably going to try to automate them anyway. Interesting all the same.