Guide for New Relic Alerting
Category
Engineering tools

Guide for New Relic Alerting

Siddarth Jain
Apr 2, 2024
10 min read
Do you have noise in your alerts? Install Doctor Droid’s Slack bot to instantly identify noisy alerts.
Read More

Introduction to Guide for New Relic Alerting

New Relic Alerting is a powerful system that enables organizations to proactively monitor their applications, infrastructure, and digital ecosystems for issues that could affect performance or reliability.

By setting up alerts, you can be notified when certain thresholds are met or anomalies are detected, allowing your team to respond quickly and minimize downtime or service disruptions.

This guide will provide a comprehensive overview of New Relic's alerting capabilities, best practices for creating effective alerts, and how to leverage advanced features such as NerdGraph APIs, synthetic monitoring, and NRQL for dynamic baseline alerting.

Whether you are new to New Relic or looking to refine your alerting strategy, this guide will walk you through each step to ensure your alerts are well-configured and actionable.

💡 Pro Tip

While choosing the right monitoring tools is crucial, managing alerts across multiple tools can become overwhelming. Modern teams are using AI-powered platforms like Dr. Droid to automate cross-tool investigation and reduce alert fatigue.

How to Create Alert Configurations Using NerdGraph APIs

New Relic’s NerdGraph API allows you to interact programmatically with New Relic’s platform, giving you full control over your alerting configuration. Through NerdGraph, you can create, manage, and monitor alert conditions, policies, and notification channels more efficiently.

Whether you're automating workflows or integrating alert configurations into custom tools, NerdGraph provides the flexibility you need to set up alerts dynamically.

Steps to Create Alert Configurations Using NerdGraph APIs

Here’s a step-by-step guide to setting up alert configurations using the NerdGraph API:

1. Accessing NerdGraph Explorer

To begin, navigate to the NerdGraph Explorer, an interactive tool that lets you run GraphQL queries and mutations. This interface helps you understand how to structure your queries and gives you a real-time preview of the data and actions you can execute on the platform.

  • Visit the NerdGraph Explorer through the New Relic dashboard.
  • Ensure you have proper access credentials (API Key) to interact with the API.

2. Creating an Alert Policy

Alert policies group together multiple alert conditions. To create a new policy using NerdGraph, you will use a GraphQL mutation.

Key Elements:

  • Replace YOUR_ACCOUNT_ID with your actual New Relic account ID.
  • incidentPreference can be set to PER_CONDITION, PER_POLICY, or PER_CONDITION_AND_TARGET, depending on how you want to structure incidents within your policy.

3. Creating Alert Conditions

Alert conditions specify the rules under which incidents will be triggered.

For instance, you can create conditions based on metrics, events, or NRQL queries.

Example: Setting up an alert condition based on NRQL to monitor CPU usage.

Key Elements:

  • query: Defines the NRQL query that retrieves the data for the condition.
  • threshold: The threshold at which the alert will trigger.
  • operator: Defines whether the alert is triggered when the value is ****ABOVE or BELOW the threshold.
  • duration: Specifies how long the condition must be met for the alert to trigger.
  • aggregationWindow: The time window over which data is aggregated.

For additional details on creating conditions, you can refer to Create Alert Conditions.

4. Create Notification Channels

Once your policies and conditions are in place, you can set up notification channels (such as email, Slack, or PagerDuty) to receive alerts.

Mutation Example:

Key Elements:

  • type: Notification channel type (Slack, email, etc.).
  • config: Configuration for the notification channel, such as the URL for webhooks or email addresses for email alerts.

5. Link Conditions to Notification Channels

After defining policies, conditions, and notification channels, the final step is to link them together.

Mutation Example:

This mutation ensures that alerts generated from the specified conditions are sent to the designated notification channels.

Benefits of Using NerdGraph for Alerting:

  • Automation: Easily create, modify, and delete alerts programmatically across your applications and accounts.
  • Customization: Tailor alert policies and conditions to meet specific needs, ensuring your team receives the most relevant notifications.
  • Efficiency: Streamline the process of setting up alerts, reducing the time and effort required for manual configurations.

By using New Relic’s NerdGraph API, you can seamlessly automate the configuration of alerts, helping your teams stay proactive in monitoring and resolving issues efficiently.

💡 Pro Tip

While choosing the right monitoring tools is crucial, managing alerts across multiple tools can become overwhelming. Modern teams are using AI-powered platforms like Dr. Droid to automate cross-tool investigation and reduce alert fatigue.

How to Set Up Synthetic Monitoring Alert Conditions in New Relic

Synthetic monitoring in New Relic enables proactive monitoring of your website and APIs by simulating user behavior to detect performance issues before they affect real users. To ensure you are alerted about potential problems in your synthetic monitoring, you need to configure alert conditions tailored to your needs.

Steps to Set Up Synthetic Monitoring Alert Conditions

A synthetic monitor can be added to multiple alert policies and conditions.

Image Source: Example of Summary of Synthetic Monitoring Report

To add an existing monitor to an alert policy:

  1. Navigate to one.newrelic.com > Alerts > Alert policies.
  2. Use the search box or scroll through the list of existing alert policies to find those to which the monitor has yet to be added.
  3. Open the chosen policy and click + New alert condition.
  4. Choose Use guided mode.
  5. Select Synthetic monitors and click Next.
  6. Pick your synthetic monitor, choose the metric to monitor, and click Next.
  7. Complete the remaining settings and click Next.
  8. Name the alert condition, adjust any optional settings, and click Save condition.

By setting up synthetic monitoring alert conditions, you’ll be better equipped to detect and address performance issues before they affect your users, ensuring higher reliability and improved customer satisfaction.

How to Set Up NRQL for Dynamic Baseline Alerting

Dynamic baseline alerting in New Relic uses NRQL (New Relic Query Language) to set adaptive thresholds based on historical data trends. This ensures that alerts are only triggered when deviations are abnormal for that specific time and context rather than using a static threshold that might not account for natural fluctuations.

[Image Source](https://newrelic.com/blog/nerdlog/nrql-baseline-alerts-ga#:~:text=Nate Heinrich is a product,products available on such sites.): Example screenshot of NRQL Baseline Alerts

Here's how you can set up NRQL queries for dynamic baseline alerting:

Step 1: Create an Alert Policy

  • In New Relic, go to the Alerts, Navigate to Alert policies, and click Create a policy.
  • Name your alert policy, and ensure it's linked to the relevant services or entities you wish to monitor.

Step 2: Add a New Alert Condition

  • Within the alert policy, click Add a condition.
  • Choose NRQL alert as the condition type, which allows you to define dynamic baseline alert conditions using custom NRQL queries.

Step 3: Write an NRQL Query for Your Baseline Condition

NRQL is highly flexible and allows you to query a variety of metrics. When creating dynamic baseline alerting, you’re querying specific metrics that matter most to your system, such as response time, error rates, throughput, etc.

Source

Replace "YOUR_ACCOUNT_ID" with your actual account ID and adjust the NRQL query and other fields to suit your baseline condition needs.

Step 4: Select the Dynamic Baseline Feature

Once you’ve defined your NRQL query, you can set dynamic thresholds by leveraging New Relic’s Dynamic Baseline feature. This option allows you to set adaptive thresholds based on the historical data of the metric you’re querying.

  • After adding the NRQL query, scroll down to the Threshold type section.
  • Choose Baseline under threshold settings, which enables dynamic alerting based on the expected behavior of the metric.

Baseline conditions use historical data to create an expected range of values that fluctuates dynamically over time.

Step 5: Set Up Alert Thresholds

You can define how sensitive the dynamic baseline should be by setting anomalous behavior thresholds.

  • Critical Threshold: Set the condition that would trigger a critical alert. For example, you might want to trigger an alert if the response time exceeds the baseline by 10%.
  • Warning Threshold: Set a lower threshold for warnings, which could indicate potential issues but are not critical yet.

Step 6: Notification Channels

After setting up the dynamic baseline, choose how you want to be notified when the alert triggers. You can set up notification channels for:

  • Email
  • Slack
  • Webhook
  • PagerDuty
  • Custom integrations

Step 7: Save and Enable Your Alert Condition

  • Review your alert policy and ensure that the condition and dynamic baseline setup meet your requirements.
  • Click Save to activate the alert condition.

Best Practices for NRQL Baseline Alerting

  1. Monitor the Right Metrics: Ensure that your query targets the most critical metrics, such as error rates, throughput, or response times, for a more effective incident response.
  2. Avoid Over-Alerting: Dynamic baseline alerting prevents alert fatigue, as alerts are only triggered when behavior deviates significantly from the historical trend, reducing false positives.
  3. Tweak the Sensitivity: Customize the thresholds to ensure you strike the right balance between proactive monitoring and over-alerting.

By setting up NRQL for dynamic baseline alerting, you can achieve a more intelligent monitoring system that adapts to your infrastructure’s natural fluctuations, ensuring timely and relevant alerts.

💡 Pro Tip

While choosing the right monitoring tools is crucial, managing alerts across multiple tools can become overwhelming. Modern teams are using AI-powered platforms like Dr. Droid to automate cross-tool investigation and reduce alert fatigue.

How to Create Good Alerts

Creating effective alerts is essential for reducing incident noise, improving response times, and helping on-call engineers focus on real issues. Good alerts provide actionable information, are symptom-led, and guide teams toward quick resolutions.

Here's how you can ensure your alerts are effective:

1. Symptom-Led Alerts

Good alerts are based on observable symptoms, not just raw metrics. Instead of alerting on low-level technical details, focus on what those metrics mean for the system and the user experience. This makes it easier for engineers to quickly understand the severity of the problem and its potential impact.

For example:

  • Symptom-led alert: "Increased page load time detected for 5% of users."
  • Non-symptom-led alert: "CPU usage above 80%."

Symptom-led alerts ensure that the team focuses on issues affecting users rather than chasing down every metric fluctuation.

2. Detailed Heading and Description

A good alert should have a clear, descriptive title and detailed explanation. The title should convey the problem at a glance, while the description should provide more context, such as what part of the system is impacted, how critical it is, and any relevant history of similar issues.

Best practices for titles and descriptions:

  • Title: Be concise but informative. For example, "Database response time exceeds 2 seconds for 10% of requests."
  • Description: Provide additional details, such as relevant metrics, when the issue started, and the potential impact on users or business operations. Include the exact service or component that is affected to help engineers triage quickly.

3. Contextual Runbooks

Every alert should be paired with a runbook or troubleshooting guide to help on-call engineers quickly address the problem. Contextual runbooks provide step-by-step instructions tailored to the specific incident, ensuring engineers know how to respond.

A contextual runbook should include:

  • Links to monitoring dashboards
  • Relevant logs or metrics to check
  • A list of common troubleshooting steps
  • Escalation paths and contacts if the issue cannot be resolved within a given timeframe

By pairing alerts with contextual runbooks, you equip your team with the information they need to diagnose and resolve incidents faster, minimizing downtime and stress.

This approach to creating good alerts ensures they are clear, actionable, and helpful for the team managing incidents, leading to more efficient response and resolution times.

💡 Pro Tip

While choosing the right monitoring tools is crucial, managing alerts across multiple tools can become overwhelming. Modern teams are using AI-powered platforms like Dr. Droid to automate cross-tool investigation and reduce alert fatigue.

Examples of Bad Alerts for SRE or On-Call Engineers

Poorly configured alerts can lead to alert fatigue, wasted time, and confusion during critical incidents. These types of alerts often lack actionable information, are too noisy, or fail to provide context, making it difficult for Site Reliability Engineers (SRE) and on-call teams to respond effectively.

Here are some examples of bad alerts and why they are problematic:

1. Overly Generic Alerts

Bad Alert: "CPU usage high."

  • Problem: This alert is vague and lacks specificity. It doesn't indicate which server or service is experiencing the issue, what the threshold is, or why this matters. The alert provides no context about potential impact, leaving the engineer to investigate from scratch.

2. Threshold-Based Alerts Without Context

Bad Alert: "Memory usage exceeds 90%."

  • Problem: While memory usage can be a problem, this type of alert without context often leads to unnecessary noise. For example, memory-intensive applications may regularly hover around high memory usage without causing any issues. Without correlating the memory spike to actual performance degradation, this alert doesn't provide actionable insights.

3. Alert Storms from Multiple Sources

Bad Alert: "Multiple alerts for different metrics on the same service (CPU, Disk I/O, Network Traffic)."

  • Problem: Alert storms occur when multiple related alerts fire simultaneously without correlation. This overwhelms the on-call engineer, making it difficult to identify the root cause. The alerts may all stem from the same underlying issue, but the engineer has to sift through each alert to find it, wasting valuable time.

4. Non-Symptom-Based Alerts

Bad Alert: "Disk space at 80% capacity."

  • Problem: This alert doesn't indicate whether the disk space usage is actually causing any service degradation or impacting the end-user experience. Non-symptom-based alerts often cause teams to focus on system metrics without understanding the user impact, leading to unnecessary escalations.

5. Lack of Clear Description or Runbook

Bad Alert: "Error detected in Service X."

  • Problem: This alert lacks detailed information, such as the specific error or possible troubleshooting steps. Without a clear description or links to a runbook, the on-call engineer has to start from scratch, wasting valuable time in a critical situation.

6. Alerts Triggered Too Frequently (Noisy Alerts)

Bad Alert: "CPU usage spiked to 75% (triggering every 2 minutes)."

  • Problem: Frequent alerts for minor, transient issues lead to alert fatigue. If engineers are bombarded with alerts for every small fluctuation, they may start ignoring or missing critical ones. Alerts should only trigger when the issue persists or is likely to cause significant impact.

7. Non-Actionable Alerts

Bad Alert: "Service XYZ reached 500 requests per second."

  • Problem: This alert provides no indication of whether this is within normal operating parameters or a sign of an issue. Non-actionable alerts that lack guidance or next steps cause confusion, leading engineers to ignore them or waste time investigating non-issues.

Why Bad Alerts Are Harmful:

  • Alert Fatigue: Engineers are bombarded with too many meaningless or vague alerts, leading to desensitization or missed critical alerts.
  • Time Waste: Engineers spend time investigating alerts that don’t have enough information or aren’t actually indicative of problems.
  • Delayed Response: Poorly configured alerts slow down the incident response process by requiring more manual investigation and analysis to understand the actual problem.

Ensuring your alerts are clear, actionable, and symptom-led can dramatically improve the efficiency and effectiveness of incident response teams.

💡 Pro Tip

While choosing the right monitoring tools is crucial, managing alerts across multiple tools can become overwhelming. Modern teams are using AI-powered platforms like Dr. Droid to automate cross-tool investigation and reduce alert fatigue.

Examples of Good Alerts for SRE or On-Call Engineers in New Relic

Good alerts are designed to notify SRE or on-call engineers only when there is a significant issue that requires action. Effective alerts should be actionable, context-rich, and tailored to the system's unique operational needs.

Here are some characteristics of good alerts in New Relic:

  1. Threshold-based CPU Usage Alert: If the CPU usage of a server exceeds 80% for more than 5 minutes, it could indicate an issue that needs attention. This alert provides a meaningful threshold to act on, avoiding noise from short-lived spikes.
  2. Database Latency Alert: Trigger an alert if database query response times exceed 300ms for more than 10 consecutive minutes. This alert helps engineers address slow performance before users are impacted.
  3. Service-Level Objective (SLO) Violation Alert: Set alerts based on your SLOs, such as a 99.9% uptime goal. An alert can fire if the service availability drops below 99.9% for a particular time window.
  4. Transaction Error Rate Alert: If a particular web transaction starts failing (e.g., error rate > 2% for 5 minutes), this type of alert informs the team that a specific service is experiencing issues.
  5. Memory Leak Detection Alert: A sustained increase in memory usage over time (without recovery) might indicate a memory leak. If memory usage grows consistently for 30 minutes or more, alert the SRE team.

💡 Pro Tip

While choosing the right monitoring tools is crucial, managing alerts across multiple tools can become overwhelming. Modern teams are using AI-powered platforms like Dr. Droid to automate cross-tool investigation and reduce alert fatigue.

Examples of Bad Alerts for SRE or On-Call Engineers in New Relic

Bad alerts often contribute to alert fatigue, overwhelm on-call engineers, and distract teams from focusing on the real issues. These alerts are generally too frequent, not actionable, or provide insufficient context.

Below are examples of what constitutes bad alerts:

  1. Short-lived CPU Spike Alerts: Alerting every time CPU usage spikes above 50% for a few seconds leads to constant noise, as short-term CPU spikes are often harmless and expected during normal operation.
  2. High-Volume, Non-Actionable Logs: If an alert is set to trigger every time a log message appears, regardless of severity or error type, it floods the on-call engineer’s alert system without providing meaningful insight.
  3. Overly Broad Network Latency Alerts: An alert for “network latency is higher than usual” without specifying which service, endpoint, or region is affected provides no actionable information and often results in confusion.
  4. Non-Critical Disk Usage Alerts: An alert for disk usage reaching 70% capacity may not require immediate attention. If there are frequent alerts for non-critical disk space levels, they can be ignored, creating a habit of disregarding more critical alerts.
  5. Every Minor Error Alert: Setting an alert for every HTTP 404 or 500 error without considering the overall error rate or context is a common mistake. A few minor errors are often part of normal system behavior.

💡 Pro Tip

While choosing the right monitoring tools is crucial, managing alerts across multiple tools can become overwhelming. Modern teams are using AI-powered platforms like Dr. Droid to automate cross-tool investigation and reduce alert fatigue.

Ready to simplify your observability stack?

Dr. Droid works with your existing tools to automate alert investigation and diagnosis.
Start Free POC →

Conclusion

Effective alerting is essential for ensuring timely incident response and minimizing downtime. New Relic offers a comprehensive suite of alerting tools that allow you to monitor your systems, set dynamic thresholds, and automate responses.

By leveraging key features such as NRQL-based dynamic alerting and synthetic monitoring, you can create more intelligent and actionable alerts that reduce noise and improve incident resolution.

Doctor Droid PlayBooks takes incident management a step further by integrating dynamic alerts, contextual investigations, and seamless automation. With Doctor Droid, teams can configure alerts that lead directly to actionable playbooks, ensuring that responses are swift, informed, and effective.

Visit our website now!

Want to reduce alerts and fix issues faster?
Managing multiple tools? See how Dr. Droid automates alert investigation across your stack

Table of Contents

Made with ❤️ in Bangalore & San Francisco 🏢

Doctor Droid