New Relic Alerting is a powerful system that enables organizations to proactively monitor their applications, infrastructure, and digital ecosystems for issues that could affect performance or reliability.
By setting up alerts, you can be notified when certain thresholds are met or anomalies are detected, allowing your team to respond quickly and minimize downtime or service disruptions.
This guide will provide a comprehensive overview of New Relic's alerting capabilities, best practices for creating effective alerts, and how to leverage advanced features such as NerdGraph APIs, synthetic monitoring, and NRQL for dynamic baseline alerting.
Whether you are new to New Relic or looking to refine your alerting strategy, this guide will walk you through each step to ensure your alerts are well-configured and actionable.
New Relic’s NerdGraph API allows you to interact programmatically with New Relic’s platform, giving you full control over your alerting configuration. Through NerdGraph, you can create, manage, and monitor alert conditions, policies, and notification channels more efficiently.
Whether you're automating workflows or integrating alert configurations into custom tools, NerdGraph provides the flexibility you need to set up alerts dynamically.
Here’s a step-by-step guide to setting up alert configurations using the NerdGraph API:
To begin, navigate to the NerdGraph Explorer, an interactive tool that lets you run GraphQL queries and mutations. This interface helps you understand how to structure your queries and gives you a real-time preview of the data and actions you can execute on the platform.
Alert policies group together multiple alert conditions. To create a new policy using NerdGraph, you will use a GraphQL mutation.
Key Elements:
Alert conditions specify the rules under which incidents will be triggered.
For instance, you can create conditions based on metrics, events, or NRQL queries.
Example: Setting up an alert condition based on NRQL to monitor CPU usage.
Key Elements:
For additional details on creating conditions, you can refer to Create Alert Conditions.
Once your policies and conditions are in place, you can set up notification channels (such as email, Slack, or PagerDuty) to receive alerts.
Mutation Example:
Key Elements:
After defining policies, conditions, and notification channels, the final step is to link them together.
Mutation Example:
This mutation ensures that alerts generated from the specified conditions are sent to the designated notification channels.
By using New Relic’s NerdGraph API, you can seamlessly automate the configuration of alerts, helping your teams stay proactive in monitoring and resolving issues efficiently.
Synthetic monitoring in New Relic enables proactive monitoring of your website and APIs by simulating user behavior to detect performance issues before they affect real users. To ensure you are alerted about potential problems in your synthetic monitoring, you need to configure alert conditions tailored to your needs.
A synthetic monitor can be added to multiple alert policies and conditions.
Image Source: Example of Summary of Synthetic Monitoring Report
To add an existing monitor to an alert policy:
By setting up synthetic monitoring alert conditions, you’ll be better equipped to detect and address performance issues before they affect your users, ensuring higher reliability and improved customer satisfaction.
Dynamic baseline alerting in New Relic uses NRQL (New Relic Query Language) to set adaptive thresholds based on historical data trends. This ensures that alerts are only triggered when deviations are abnormal for that specific time and context rather than using a static threshold that might not account for natural fluctuations.
[Image Source](https://newrelic.com/blog/nerdlog/nrql-baseline-alerts-ga#:~:text=Nate Heinrich is a product,products available on such sites.): Example screenshot of NRQL Baseline Alerts
Here's how you can set up NRQL queries for dynamic baseline alerting:
Step 1: Create an Alert Policy
Step 2: Add a New Alert Condition
Step 3: Write an NRQL Query for Your Baseline Condition
NRQL is highly flexible and allows you to query a variety of metrics. When creating dynamic baseline alerting, you’re querying specific metrics that matter most to your system, such as response time, error rates, throughput, etc.
Replace "YOUR_ACCOUNT_ID" with your actual account ID and adjust the NRQL query and other fields to suit your baseline condition needs.
Step 4: Select the Dynamic Baseline Feature
Once you’ve defined your NRQL query, you can set dynamic thresholds by leveraging New Relic’s Dynamic Baseline feature. This option allows you to set adaptive thresholds based on the historical data of the metric you’re querying.
Baseline conditions use historical data to create an expected range of values that fluctuates dynamically over time.
You can define how sensitive the dynamic baseline should be by setting anomalous behavior thresholds.
After setting up the dynamic baseline, choose how you want to be notified when the alert triggers. You can set up notification channels for:
By setting up NRQL for dynamic baseline alerting, you can achieve a more intelligent monitoring system that adapts to your infrastructure’s natural fluctuations, ensuring timely and relevant alerts.
Creating effective alerts is essential for reducing incident noise, improving response times, and helping on-call engineers focus on real issues. Good alerts provide actionable information, are symptom-led, and guide teams toward quick resolutions.
Here's how you can ensure your alerts are effective:
Good alerts are based on observable symptoms, not just raw metrics. Instead of alerting on low-level technical details, focus on what those metrics mean for the system and the user experience. This makes it easier for engineers to quickly understand the severity of the problem and its potential impact.
For example:
Symptom-led alerts ensure that the team focuses on issues affecting users rather than chasing down every metric fluctuation.
A good alert should have a clear, descriptive title and detailed explanation. The title should convey the problem at a glance, while the description should provide more context, such as what part of the system is impacted, how critical it is, and any relevant history of similar issues.
Best practices for titles and descriptions:
Every alert should be paired with a runbook or troubleshooting guide to help on-call engineers quickly address the problem. Contextual runbooks provide step-by-step instructions tailored to the specific incident, ensuring engineers know how to respond.
A contextual runbook should include:
By pairing alerts with contextual runbooks, you equip your team with the information they need to diagnose and resolve incidents faster, minimizing downtime and stress.
This approach to creating good alerts ensures they are clear, actionable, and helpful for the team managing incidents, leading to more efficient response and resolution times.
Poorly configured alerts can lead to alert fatigue, wasted time, and confusion during critical incidents. These types of alerts often lack actionable information, are too noisy, or fail to provide context, making it difficult for Site Reliability Engineers (SRE) and on-call teams to respond effectively.
Here are some examples of bad alerts and why they are problematic:
Bad Alert: "CPU usage high."
Bad Alert: "Memory usage exceeds 90%."
Bad Alert: "Multiple alerts for different metrics on the same service (CPU, Disk I/O, Network Traffic)."
Bad Alert: "Disk space at 80% capacity."
Bad Alert: "Error detected in Service X."
Bad Alert: "CPU usage spiked to 75% (triggering every 2 minutes)."
Bad Alert: "Service XYZ reached 500 requests per second."
Ensuring your alerts are clear, actionable, and symptom-led can dramatically improve the efficiency and effectiveness of incident response teams.
Good alerts are designed to notify SRE or on-call engineers only when there is a significant issue that requires action. Effective alerts should be actionable, context-rich, and tailored to the system's unique operational needs.
Here are some characteristics of good alerts in New Relic:
Bad alerts often contribute to alert fatigue, overwhelm on-call engineers, and distract teams from focusing on the real issues. These alerts are generally too frequent, not actionable, or provide insufficient context.
Below are examples of what constitutes bad alerts:
Effective alerting is essential for ensuring timely incident response and minimizing downtime. New Relic offers a comprehensive suite of alerting tools that allow you to monitor your systems, set dynamic thresholds, and automate responses.
By leveraging key features such as NRQL-based dynamic alerting and synthetic monitoring, you can create more intelligent and actionable alerts that reduce noise and improve incident resolution.
Doctor Droid PlayBooks takes incident management a step further by integrating dynamic alerts, contextual investigations, and seamless automation. With Doctor Droid, teams can configure alerts that lead directly to actionable playbooks, ensuring that responses are swift, informed, and effective.