Symptom-Based Alerts: Putting User Experience at the Forefront
3 min read
Setting up SLOs and alerts based on user goals & objectives
In today's digital age, user experience is the key. While traditional monitoring & observability tools have been diligent in flagging metrics from our infrastructure & APIs, there is often a disconnect between these metrics and the user's real-world experience. It is important for engineering teams to complement system observability with tracking of customer symptoms & SLOs.
What are Symptom-based alerts?
Symptom-based alerting refers to monitoring the customer's “goal” / “experience”, especially when it comes to setting up alerts & SLOs. Tracking customer experience & goals are strong and actionable way to track the behavior of the user.
”Operations is ultimately a business problem, not just a technical one.”
— Blog by the Google Cloud team
Risks of skipping symptom-based alerts
Distributed systems are already hard to troubleshoot and investigate — getting too many alerts for an on-call engineer to troubleshoot doesn’t help teams much.
Benefits of adding symptom-based alerts:
User-Centric Approach: With symptom-based alerts and SLOs, user experience always remains central, translating telemetry data into actionable real-world insights on what’s happening with users.
Reduced Noise: Traditional monitoring can flood teams with alerts, many of which might be insignificant in the context of overall system health. Symptom-based alerts focus on noticeable patterns, drastically reducing the number of irrelevant notifications.
Immediate Impact Recognition: By highlighting issues that directly impact user experience, teams can act proactively and faster, mitigating potential challenges and identifying root causes much faster.
Setting up symptom-based alerts:
Adding symptom-based alerts with custom instrumentation means defining SLOs and metrics that can define the customer experience/ goal. This definition can happen at multiple points in the development lifecycle:
As part of the design process
Iterate after product/feature launch
Re-iterate after product stability
While working on setting them up, here’s a simple framework to help you keep it actionable:
Mistakes to avoid while setting up alerts:
1. Only tracking individual components and not end goals:
This mistake could lead to missing out on tracking critical workflows that might be split across asynchronous steps.
Potential blind spot: A silently failing scheduled cron job or a failure in publishing to a queue could lead to a customer impact, completely missed by the team.
2. Relying only on auto-instrumented metrics:
Complement the APM golden signals and infrastructure metrics alongside custom metrics representing your user experience.
Potential blind spot: Error rate of your payment service, or distribution of the response_status_code ≠ tracking of successful payment rate.
3. Not adding tags/ identifiers:
Add identifiers in your metrics to help you identify impacted users — these tags could vary from a “client name” to your user’s “device type” to the “user-id”.
Potential blind spot: Your overall SLOs might be well within the limits even though it might have breached significantly for a specific customer. Without the tags, it’ll be hard for your team to be able to identify the radius of impact.
4. Missing out on adding the configurations in logs:
Configurations are an essential lifeline of any application and there will be an impact
Potential blind spot: A recent configuration change might have triggered an impact to your users, but might go unnoticed if there’s no way to correlate your metrics to the configurations.
5. Using alerts as a goal, not a means to improvement:
While it’s critical to improve the alerting & monitoring capabilities for operational reasons, it’s a very powerful methodology to also identify areas of improvement in your application and make them more reliable. 😊
If you want to read more about the topic, I’d recommend this document authored by Rob Ewaschuk, an SRE at Google.
About Doctor Droid:
Doctor Droid is a real-time analytics platform to help teams create and track critical product & operational metrics with smart alerts & dashboards. Here's the link to sign up and try the product!