An Engineering Manager's Guide to Alert Operations — Part 2
3 min read
Tips & tricks to manage alert routing and fine-tuning in complex microservices architecture.
Part 2 - Managing and Improving Alert Operations
In the previous article, I discussed how to measure the quality of your alerts and what you can do to optimise them and make them more useful. This section will provide a deeper dive into important aspects of alerts and how to use them to stay ahead of incidents and customer impact. The goal is to ensure that alerts are actionable without overwhelming your team with unnecessary noise.
Assuming that you are an Engineering Manager for a team which has many critical alerts coming in day-in, day-out, here’s my take on improving them.
Objective
To make alerts actionable, at the same time not adding too much noise for the team.
Choose the right alert consumption strategy
Proactive teams choose one of the two strategies:
A single team consumes all alerts and post receiving an alerts, does the following steps in this order:
Assess impact of the incident signalled by the alert
If there is no impact, mark the alert acknowledged and move on.
If there is impact, do further investigation to identify root cause.
Execute a remediation process for that incident if available else escalate to the service owner who is most suitable to resolve it.
This escalation happens mostly in the form of a ticket using a tool like PagerDuty, OpsGenie etc. Here is a document on how to link Prometheus alerts to email, Slack, and PagerDuty.
Every team consumes alerts for their own services and to upstream + downstream components. Once an alert comes, they know what to do.
For teams of a size upto 20 engineers, the second strategy is easy to setup and manage. For bigger teams, a more streamlined on-call process is preferred.
Diving deep into threshold management
Setting up alerts for every exception, metric deviation, and infrastructure component in a product with a minimum of 10 micro-services can result in a significant number of alerts, especially as the product usage expands and fluctuates. Therefore, it is crucial to establish appropriate thresholds for each metric-based alert.
You can categorise all of your metrics into 3 buckets:
Stable but seasonal - The metric remains consistent but changes to a different average value at a different hour or day of the week, following a repetitive pattern. This usually occurs with infrastructure components. In such cases, you need to set different thresholds for different time ranges. The average should be 25% higher than the usually observed metric value.
Spiky & unpredictable - The metric fluctuates frequently but has no visible impact on the business. In these cases, you should use a p99 aggregation as the metric function and keep it 25% above the usual p99 values observed in the past few weeks. This is common with service latencies.
Very steady with no deviations - This metric can generally be left without an alert. These metrics relate to database connections, disk space, etc. If desired, you can add an average value metric that is 20% higher than the usually observed value.
For error reporting, keep each new error to be reported as it appears.
Most of the tools today do not offer dynamic thresholds on metrics. Instead, they provide downtimes/mute hours as a feature. Using this, you can enable threshold for only certain hours of the day. Hence, if you set up 5 different alerts for different time ranges, you effectively have different thresholds for different time ranges. Read more about it, here. (Scheduled downtime on Datadog, Custom times on New Relic)
Another alternative is using anomaly detection as a feature (provided by both Datadog, New Relic), allowing alerts to be configured without specific thresholds. However, these anomalies are isolated on those metrics and lack the contextual linkage to other metrics.
At Doctor Droid, we have a bot that helps improve alert operations by providing insights into coverage and noise -- it integrates with your existing observability and alerting stack. Check out the sample report and docs here.