Watch AI Investigation by Doctor Droid on 22nd October.

Dr. Patternson: How Meta reduced their MTTR by 50% using AIOps

·

7 min read

We discuss the internals of Meta's AIOps platform that reduced the time to resolve critical alerts by half

Cover Image for Dr. Patternson: How Meta reduced their MTTR by 50% using AIOps

Introduction

For Meta, reducing downtime has been crucial to ensuring millions (or should I say Billions?) of users have a seamless experience. Recently, Meta shared about one of their internal platforms that helped reduce MTTR by ~50% for critical alerts.

This blog explores how Meta accomplished this by leveraging AI, machine learning & runbook automation to transform its incident response processes, making them faster and more efficient.

Let's dive into the key components that enabled this efficiency gain.

Objective

Imagine trying to find a needle in a haystack—blindfolded. That’s what incident management can feel like without the right tools. Meta’s objective was to take off that blindfold and make the process of finding and fixing problems as swift and accurate as possible. Their goal? Cut down the Mean Time to Resolution (MTTR) by half, so that when things go wrong, they can be fixed faster than you can say “downtime.”

By harnessing the power of AI and machine learning, Meta aimed to automate the grunt work of incident management—spotting issues, figuring out what’s broken, and fixing it—all without requiring a superhero on standby. This isn’t just about cool tech; it’s about making sure users experience as little disruption as possible, turning potential disasters into minor hiccups that barely anyone notices.

What Meta built?

Alright, let's peek under the hood of Meta's incident-busting machine. They didn't just slap on a new coat of paint; they rebuilt the entire engine. Here are the three turbocharged components that turned their incident response from a clunky old jalopy into a sleek, AI-powered sports car:

Component 1: Automated Runbooks

Remember those old-school detective novels where the brilliant sleuth solves the case with a magnifying glass and a pipe? Well, Meta created a digital Sherlock Holmes, minus the pipe smoke. They call it Dr. Patternson (Dr. P for short), and it's like having a tireless detective on call 24/7.

Dr. P is an automated runbook system that encodes expert knowledge into executable investigation workflows. It's like giving every on-call engineer a cheat sheet written by the smartest person in the room. With its own SDK, simplified APIs, and ML algorithms, Dr. P can quickly analyze data, correlate events, and generate findings faster than you can say "Elementary, my dear Watson."

But wait, there's more! Dr. P comes with a fully managed platform that deploys these runbooks, monitors for issues, and even triggers investigations automatically when an alert fires. It's like having a whole team of digital detectives working round the clock, leaving no log unturned.

Component 2: Analysis Algorithms Service

If Dr. P is the detective, then the Analysis Algorithms Service is its time machine. This nifty piece of tech allows Meta's engineers to zoom through vast amounts of data at warp speed. Picture this: You've got more data than stars in the sky, and you need to find that one glowing red dot that's causing all the trouble. That's where this service comes in. It's packed with ML algorithms for dimensional analysis, time series analysis, anomaly detection, and more. But the real magic is in its pre-aggregation layer, which shrinks datasets by up to 500 times! It's like compressing the entire library of Congress into a pocket-sized book, without losing a single word.

The result? Insights that used to take hours now pop up in seconds. It's so fast, you might think it's predicting the future. (Spoiler alert: it's not. That's still on the roadmap for 2025.)

Component 3: Event Isolation Assistance

Last but not least, we have the Event Isolation Assistance. Think of it as a super-smart metal detector for that proverbial needle in a haystack.

This system uses ML models to rank thousands of events and pinpoint the root cause of an incident. It's like having a psychic on your team, except this one actually works. By focusing on config-based and code-based isolation, it can filter out 80% of the uninteresting events during an active investigation. That's right, it separates the wheat from the chaff, leaving engineers with a much smaller, much more suspicious pile of events to investigate.

But it doesn't just point fingers. The system provides annotations explaining its reasoning, making it transparent and trustworthy. It's like having a really smart friend who not only tells you the answer but also shows their work.

Guided Investigations:

Sometimes, even the smartest AI needs a human touch. That's where Guided Investigations come in. Think of it as a choose-your-own-adventure book, but for fixing tech problems.

These decision trees provide step-by-step workflows that help investigators narrow down the root cause of an issue. It's like having a seasoned pro whispering in your ear, guiding you through the digital labyrinth. By combining automated workflows with human expertise, these guided investigations can tackle complex issues that might stump a fully automated system.

And the best part? They're right where you need them, integrated with Meta's detection systems. It's like having a tech support genie, ready to pop out whenever an alert goes off. No need to rub a lamp – just click a button!

Current state of Investigations at Meta:

So, where does Meta stand now in their AIOps journey? Let's just say they've gone from digital chaos to zen master status.

Today, Meta's foundational systems are more popular than cat videos (well, almost). Hundreds of teams have adopted them, running over 500,000 analyses per week. That's more check-ups than a hypochondriac gets in a lifetime!

The impact? A cool 50% decrease in MTTR for critical alerts across the company. It's like they've upgraded from a horse-drawn buggy to a supersonic jet when it comes to fixing problems. Take the Ads Manager team, for instance. They've gone from spending days investigating issues to resolving them in minutes. It's like they've traded in their magnifying glass for a high-powered microscope with AI-assisted focusing.

Implementing your Own Dr. Patternson using Doctor Droid:

If you want to implement a solution like Dr. Patternson within your team without investing the time or cost like Microsoft, you might want to explore Doctor Droid.

Doctor Droid is a AI-assisted intelligence platform to help engineering teams reduce investigation time of production issues by 10x. Here's what you can do with Doctor Droid:

(a) Codify your investigation mental models:

Doctor Droid PlayBooks is an Open-Source On-call automation platform. With one click, you can run your investigation steps and have all the diagnosis data across all tools, directly fed in response to your alerts.

(b) Leverage your past knowledge to get intelligent suggestions:

Doctor Droid's AIOps Platform can provide your on-call engineers with intelligent recommendations by leveraging the knowledge that you already have accessible in your system over the past time.

Both A & B combined, you are effectively going to end up with an equivalent of Dr. Patternson.

Try it out today by signing up here!

Conclusion:

Meta's AIOps journey has transformed their incident response from a digital firefight into a well-oiled machine. Let's break down the impressive results:

  • 50% reduction in Mean Time to Resolution (MTTR) for critical alerts across the company

  • Over 500,000 automated analyses run per week

  • 80% of uninteresting events filtered out during active investigations

  • Ads Manager team improved investigation time from days to minutes

  • Nearly 50% of previously manual investigations now automated

These statistics paint a picture of a dramatically more efficient system, but what does it mean in the real world?

For Meta, it means:

  • Fewer service disruptions for millions of users

  • Faster resolution when issues do occur

  • Engineers spending less time on repetitive tasks and more on innovation

  • Improved overall system reliability and user experience

The secret sauce? A combination of:

  1. Automated Runbooks (Dr. Patternson)

  2. Analysis Algorithms Service

  3. Event Isolation Assistance

  4. Guided Investigations

This powerful quartet has turned Meta's incident response into a symphony of efficiency, conducting a harmonious blend of AI automation and human expertise.

As we look to the future, Meta's AIOps journey serves as a beacon for the tech industry. It shows us that with the right tools and approach, we can tame the chaos of complex systems and create a more reliable digital world. So the next time you scroll through your feed without a hitch, remember - there's a good chance Meta's AIOps team had a hand in making that seamless experience possible.