Runbooks Guide for SRE & On-call teams

July 24, 2024

What are Runbooks?

Runbooks are a set of instructions that need to be followed to execute a known procedure. In context of software, runbooks are created by SRE & engineering teams as guides to help in doing predefined tasks during on-call operations. These tasks could vary from “How to check kubernetes cluster health?” to “How to rollback?”.

What is a good runbook?

A well written runbook helps a reader (typically an engineer) with the following information:

  1. When to use the runbook: In case this information is not already covered in the objective of the runbook, there should be guidelines for the user with a situational context (when our recent deployment has impacted order volume for more than 1 hour in production) or information on alert (when you get an alert regarding a database CPU spike). This helps user avoid any mistake while development.
  2. What does that runbook help achieve: This information helps the user instantly know the goal that the below mentioned steps would help the user achieve. You can think of it as the title of the playbook.
  3. What step needs to be taken, where and how?
    • A [playbook](https://sandbox.drdroid.io/) should be rich with all the relevant links within the instructions. For example, if a step says “check for service latency and/or error rate spike” — it should ideally have an attached link to a dashboard with that information.
    • Instructions on what command to run and where? and how to reach that destination. In case the specific step requires user to “restart a service” — it should have instructions on (a) where to login, how to login (does it require ssh? or email login or escalation) (b) what exact command to run
    • How to interpret the output of that step and subsequent action to be taken: After following the abovementioned procedure, what next steps would be most relevant basis different possibilities of outcome. (For example, if certain action fixed the problem, the user should monitor for next 30 minutes but in case the problem still remains, then also explore option B).
  4. Whom to report after running that runbook.
  5. Whom to escalate if the runbook is insufficient.

Why should teams create Runbooks?

Runbooks can help drive autonomy and context sharing between team members. Here are some of the benefits you would see in your team post runbook implementation:

  • Help mitigate issues faster: your team would figure out
  • Improve developer productivity.
  • Bring upon standardisation and confidence in teams’ ability to debug.

Best Practices on writing runbooks

  1. Always keep them in a structured and organised way so it’s easy to find them — In case you are using Confluence or Wiki, add the relevant labels by team, service, etc.
  2. Ensure to always add all relevant links — dashboards, signin pages
  3. Have a meta level file which includes information on:
    • Tools & Technologies used in the company:
      • A lot of time might be spent by every engineer in figuring out permissions but if a single document is written that explains all the tools, how to access (or request access) to those tools.
      • In case you don’t have access to something, whom to request for permission?
      • Where are the most important code and product documentations — in case you need to hop between product documentation and on-call documentation oncall.
  4. Keep low on subjectivity.
    • “check if everything is fine” —
    • “Go to this specific dashboard, check if the metrics look impacted and then look at health of your pods here”
  5. Sequential execution:
    • “Check Grafana, kubelens and logs”
    • Evaluate if CPU related metrics in this Grafana dashboard are fine.
      • If yes, then check the health of k8s in lens.
        • If yes, then check application logs for any new errors.
  6. Have an onboarding document: Ask a user to go through 5-10 postmortems from previous on-call and ensure that you define what user accesses are needed by whom.
  7. Have an on-call runbook updation mandate: Every issue faced on call should either be pointed out in runbooks or new runbooks be created.
  8. Get them reviewed by atleast 1-2 team members.
  9. Make runbooks accessible across the company.
    • Runbooks help engineers get an accelerated understanding of how the other team’s system operates and what are the moving parts. It’s especially useful in case.
    • Discovery of runbooks should be intuitive, especially given the first point in the set of best practices.

Popular tools to write runbooks

Doctor Droid (Open Source) -- Playbooks is not a documentation tool but an intelligent runbook management software where you can leverage AI & Automation to accelerate your speed of incident investigations

  • Wiki
  • Github
  • Confluence
  • Google Docs

Benefits of using Runbooks

  1. Helps reduce the anxiety faced by an on-call engineer during a production incident by giving them a starting point.
  2. Reduces unplanned hours spent by senior resources on-call.
  3. Improve your MTTR.
  4. Improve your engineer’s work-life balance and improve developer productivity.
  5. Reduce dependency on “hero engineers” — helps in faster propagation of debugging knowledge.

Intro to Runbook automation

  1. Runbook automation refers to the practice of scripting or making the frequent set of steps automatically executable through code or UI without having to run each step manually.
    • Faster investigations & fixing of production incidents: An on-call engineer may/may not know how the systems operate and might often be at the mercy of documentation for implementation. If your team creates a repository of 20 automations for common issue identification, it could significantly accelerate the process of debugging and investigating an issue in Production for your on-call engineer.
    • Knowledge transfer
    • Reduced human errors
    • Ease of access management
  2. Partial Automation:
    • Often, even if you are not able to automate everything, it’s still useful to accelerate runbook investigation & running for your users.
    • For example, think of a situation where your engineer can go to one tool and get all the relevant data instead of having to go to 14 different tools and manage them individually.

Doctor Droid helps you to automate Runbooks and accelerate production incident debugging. Read more on our website.

How to create a practice of writing runbooks

Put runbook creation as:

  1. A part of documentation requirements by a team for any new launch.
  2. Mandate on-calls to update runbooks post incidents or issues.
  3. Drive a blameless investigation culture.
  4. Structure and make runbooks repository easy to discover during on-call.
    • What’s the benefit if the runbooks are not accessible to the user during an on-call.
    • Provide links to important runbooks within alerts if possible so the on-call engineer already has a starting point.
    • Make the tags and structuring of the runbooks similar to how engineers think.
    • Ensure updation between the service catalog and runbook lists so that on-call engineers can find runbooks for upstream/downstream teams too — this could
    • Leverage runbook automation.

Want to reduce alerts and fix issues faster?
Want to reduce alerts and fix issues faster?

Table of Contents

Made with ❤️ in Bangalore & San Francisco 🏢

Doctor Droid