RCACoPilot: A breakdown of how Microsoft built their Automated RCA Bot

·Sep 2, 2024·

8 min read

An article describing Microsoft's approach that helped them classify their incidents and investigate issues faster

Cover Image for RCACoPilot: A breakdown of how Microsoft built their Automated RCA Bot

Introduction

Big Tech companies often have scale enough to justify allocating resources to building internal tools. In this blog, we discuss about RCACoPilot -- an automated incident classification and investigation engine built by Microsoft to improve the lives for their on-call engineers.

Less than a year ago, Microsoft published a research paper discussing RCACoPilot. It's a longish paper ~ 16 pages, so I decided to condense it into a shorter blog.

Context

Picture this: You're an on-call engineer at Microsoft. It's 3 AM, and suddenly, alerts start blaring. Something's wrong with the email service (which delivers over 150 billion messages daily) that millions of people rely on. Your job? Figure out what's causing the issue and fix it ASAP. No pressure, right?

This scenario plays out all too often at Microsoft & at most companies. A company's systems are only getting more complex by the day, and on-call engineers were drowning in a sea of alerts, logs, and metrics. This often leads to escalations and time being spent on tickets rather than planned work. They needed a way to streamline the process and a way to quickly make sense of all this information and zero in on the root cause of problems. That's what RCACoPilot tries to solve for them.

Why should you even read this article? The Results

Let's cut to the chase: How well does RCACopilot perform & is it even worth reading about their approach? Their performance is pretty impressive, as it turns out.

Tested on 653 real-world incidents from Microsoft's email service (which handles about 150 billion messages daily), RCACopilot achieved:

76.6% accuracy in predicting root cause categories
A Macro-F1 score of 0.533, showing good performance across various incident types
Significantly reduced MTTR:
- Auto-diagnosis ran with an average time of 1-10 minutes depending on the complexity of the incident handlers (more on it below).
- An average classification time of just 4.2 seconds per incident.

[Quick note on that Macro-F1 score: It's a measure that gives equal importance to each category, regardless of how often it appears. A score of 0.533 tells us that RCACopilot performs well across various incident types, not just the common ones. This is crucial in a system where rare, critical issues are just as important as frequent, minor ones.]

These numbers outperformed all baseline methods, including traditional machine learning approaches and non fine-tuned GPT models.

But the real proof is in the deployment. Parts of RCACopilot have been in use at Microsoft for over four years, across more than 30 teams. On-call engineers report significant time savings in incident management tasks, from diagnosis to mitigation.

What this basically means is that in less than a few minutes, most of the likely investigation steps are run, analysed and the engineers are told a likely incident Root Cause category.

RCACoPilot -- The Architecture

Now, let's peek under the hood of RCACoPilot. Think of it as a super-smart detective for computer problems. Here's a simple breakdown of how it works:

Diagnosis Identifier: In the original paper, this is part of what they call the "Diagnostic Information Collection Stage". It is responsible to take in an incident from their alerting tool, parse the incident context and match it to the existing playbooks (called Incident Handlers) basis the mapping logics defined.
Diagnosis Data Fetch & Summarisation: The playbook identified in the previous section is custom defined by the on-call engineers (called OCEs). The system now executes all the steps pre-configured in their playbook -- from fetching logs & metrics to running more steps.
Incident Predictor: Now on top of the diagnosis data that's fetched, a couple of things are done: converting it into an embedding and comparing the embedding with past embeddings. These embeddings are now leveraged to identify the potential Root Cause for this issue. This corresponds to the "Root Cause Prediction Stage" in the paper.

The actual benefit of RCACoPilot lies in how the entire pipeline works out. When an alert comes in, the Incident Handler kicks into gear, following predefined playbooks to gather relevant data. This could involve querying databases, analyzing log files, or even running diagnostic scripts.

Component 1: Diagnosis Identifier

Incident Parser:

The Incident is the entry point of RCACoPilot. It's connected to the existing alerting systems in the company through triggers or webhooks. It then parses the incident to identify the key entities of interest.

Handler:

A Handler is set of steps pre-defined to be run for a specific type of investigation. This is effectively a programmatic SOP for a certain type of issue. These are NOT AI GENERATED. These are something that every on-call engineer documents -- the investigation strategies for a type of issue within their service. Types of Actions

The Incident Handler uses several types of actions to investigate and respond to incidents:

Scope Switching Action: This allows the handler to adjust its focus dynamically. It might start by looking at a single server, then expand to an entire cluster if needed.
Query Action: Think of this as the handler's way of asking questions. It can pull data from various sources like databases, log files, or even run scripts to gather system information. The results come back as key-value pairs, giving the handler structured data to work with.
Mitigation Action: Sometimes, the handler can take steps to address the problem directly. This could involve restarting a service, clearing disk space, or even calling in specialized teams for complex issues.

This is what an incident handler looks like for too many messages stuck in the delivery queue alert.

Component 2: Diagnosis Data Fetch & Summarisation

In this part, the system is executing a series of commands as per the incident handler definition.

It has access to different internal / external tools with the relevant information.
Each step in handler is structured and mapped to a technical task (be it an API call, a log fetch, a metric fetch or any other task).
The system automatically interacts with every different data tool and then fetches it.

These are some of the data points that the system can fetch from the handlers (playbooks):

Logs: Application logs, system logs, security logs.
Metrics: Performance metrics and resource utilization stats.
Traces: Detailed records of how requests flow through the system.
Configuration data: Current system settings that might be relevant.

Component 3: Incident Predictor

The ML part of RCACopilot happens in what we call "Incident Predictor". This is where it uses Large Language Models (LLMs) to make sense of all the data collected by the Incident Handler.

Here's how it uses LLMs to analyze incidents:

Summarization: First, the LLM takes all the diagnostic information collected and creates a concise summary. This step is crucial because it condenses vast amounts of data into something manageable for both the AI and human engineers.

Similarity Matching: Next, RCACopilot creates an embedding to represent incidents as points in a high-dimensional space and then uses a Nearest Neighbour Search Algorithm. ELI5: similar incidents will be close to each other in this space.

Using this technique, RCACopilot finds past incidents that are most similar to the current one. This is important because similar past incidents can provide valuable clues about the current problem.
Chain-of-Thought Prompting: This is where things get really interesting. RCACopilot uses a technique called "Chain-of-thought" prompting. Instead of just asking the LLM "What's the root cause?", it prompts the model to think through the problem step-by-step, much like a human engineer would.

It does this by showing the LLM examples of how similar past incidents were solved. This is akin to training a junior engineer by walking them through past case studies before asking them to solve a new problem.
Root Cause Prediction: Based on this careful analysis, the LLM then predicts the most likely root cause of the incident. But it doesn't stop there.
Explanation Generation: Crucially, the LLM also generates an explanation for its prediction. This isn't just a black box spitting out an answer - it's more like a colleague explaining their reasoning. This explanation helps human engineers understand and verify the AI's conclusion.

Feedback loop: While it doesn't learn in real-time, feedback from engineers can be used to periodically retrain and improve the model. This means RCACopilot can get better over time, learning from each incident it analyzes.

By leveraging the power of LLMs in this way, RCACoPilot can quickly analyze complex incidents, drawing insights from vast amounts of data and past experiences. It's like having an AI assistant that has seen every incident your organization has ever faced, can think through problems step-by-step, and can clearly explain its reasoning. This not only speeds up incident resolution but also helps engineers learn and improve their own diagnostic skills.

Implementing your Own RCACoPilot using Doctor Droid:

If you want to implement a solution like RCACoPilot within your team without investing the time or cost like Microsoft, you might want to explore Doctor Droid.

Doctor Droid is a AI-assisted intelligence platform to help engineering teams reduce investigation time of production issues by 10x. Here's what you can do with Doctor Droid:

(a) Codify your investigation mental models:

Doctor Droid PlayBooks is an Open-Source On-call automation platform. With one click, you can run your investigation steps and have all the diagnosis data across all tools, directly fed in response to your alerts.

(b) Leverage your past knowledge to get intelligent suggestions:

Doctor Droid's AIOps Platform can provide your on-call engineers with intelligent recommendations by leveraging the knowledge that you already have accessible in your system over the past time.

Both A & B combined, you are effectively going to end up with an equivalent of RCACoPilot.

Try it out today by signing up here!