Symptom-Based Alerts: Putting User Experience at the Forefront

·

3 min read

Setting up SLOs and alerts based on user goals & objectives

Cover Image for Symptom-Based Alerts: Putting User Experience at the Forefront

In today's digital age, user experience is the key. While traditional monitoring & observability tools have been diligent in flagging metrics from our infrastructure & APIs, there is often a disconnect between these metrics and the user's real-world experience. It is important for engineering teams to complement system observability with tracking of customer symptoms & SLOs.

What are Symptom-based alerts?

Symptom-based alerting refers to monitoring the customer's “goal” / “experience”, especially when it comes to setting up alerts & SLOs. Tracking customer experience & goals are strong and actionable way to track the behavior of the user.

”Operations is ultimately a business problem, not just a technical one.”

— Blog by the Google Cloud team

Risks of skipping symptom-based alerts

Distributed systems are already hard to troubleshoot and investigate — getting too many alerts for an on-call engineer to troubleshoot doesn’t help teams much.

Benefits of adding symptom-based alerts:

  1. User-Centric Approach: With symptom-based alerts and SLOs, user experience always remains central, translating telemetry data into actionable real-world insights on what’s happening with users.

  2. Reduced Noise: Traditional monitoring can flood teams with alerts, many of which might be insignificant in the context of overall system health. Symptom-based alerts focus on noticeable patterns, drastically reducing the number of irrelevant notifications.

  3. Immediate Impact Recognition: By highlighting issues that directly impact user experience, teams can act proactively and faster, mitigating potential challenges and identifying root causes much faster.

Setting up symptom-based alerts:

Adding symptom-based alerts with custom instrumentation means defining SLOs and metrics that can define the customer experience/ goal. This definition can happen at multiple points in the development lifecycle:

  • As part of the design process

  • Iterate after product/feature launch

  • Re-iterate after product stability

While working on setting them up, here’s a simple framework to help you keep it actionable:

Mistakes to avoid while setting up alerts:

1. Only tracking individual components and not end goals:

This mistake could lead to missing out on tracking critical workflows that might be split across asynchronous steps.

Potential blind spot: A silently failing scheduled cron job or a failure in publishing to a queue could lead to a customer impact, completely missed by the team.

2. Relying only on auto-instrumented metrics:

Complement the APM golden signals and infrastructure metrics alongside custom metrics representing your user experience.

Potential blind spot: Error rate of your payment service, or distribution of the response_status_code tracking of successful payment rate.

3. Not adding tags/ identifiers:

Add identifiers in your metrics to help you identify impacted users — these tags could vary from a “client name” to your user’s “device type” to the “user-id”.

Potential blind spot: Your overall SLOs might be well within the limits even though it might have breached significantly for a specific customer. Without the tags, it’ll be hard for your team to be able to identify the radius of impact.

4. Missing out on adding the configurations in logs:

Configurations are an essential lifeline of any application and there will be an impact

Potential blind spot: A recent configuration change might have triggered an impact to your users, but might go unnoticed if there’s no way to correlate your metrics to the configurations.

5. Using alerts as a goal, not a means to improvement:

While it’s critical to improve the alerting & monitoring capabilities for operational reasons, it’s a very powerful methodology to also identify areas of improvement in your application and make them more reliable. 😊

If you want to read more about the topic, I’d recommend this document authored by Rob Ewaschuk, an SRE at Google.

About Doctor Droid:

Doctor Droid is a real-time analytics platform to help teams create and track critical product & operational metrics with smart alerts & dashboards. Here's the link to sign up and try the product!

Written by

Siddarth Jain

Hey! I'm currently building something cool for engineers at Dr Droid. Love discussing developer tools with anyone interested.

I prefer to work remotely and spend time in nature.