Production-Ready Template

Production-Ready Kubernetes Monitoring with Prometheus: Alert Templates That Work

Monitoring Kubernetes at scale is critical for reliability and uptime. Prometheus offers powerful metric collection and alerting capabilities, but crafting effective alerts is challenging. The open-source template from DrDroidLab provides a curated set of production-grade alert rules for Kubernetes, helping SREs and platform engineers detect issues early and respond quickly. This post breaks down the key alert rules, explains how to use them, and offers tuning guidance for adapting them to your Kubernetes workloads.

Get Template

Core Alert Rule

KubePodCrashLooping

Critical Performance Bottleneck

sum(increase(kube_pod_container_status_restarts_total[15m])) by (pod, namespace) > 5

Why this matters

Fires when a pod is crashing frequently (more than 5 restarts in 15 minutes), catching applications stuck in a CrashLoopBackOff state.

Tuning tips

Adjust the threshold (e.g., 5 restarts) and the time window (15m) to match the expected stability of your workloads—more tolerant for volatile development pods, more aggressive for production services.

KubePodNotReady

Operations blocking event loop

kube_pod_status_ready{condition="false"} == 1

Why this matters

Detects pods that are not in a ready state, indicating they're unhealthy or failing readiness checks.

Tuning tips

Consider filtering by namespace or critical workload labels to reduce noise from less critical pods.

KubePersistentVolumeFullInFourDays

Memory efficiency warning

(kubelet_volume_stats_available_bytes / kubelet_volume_stats_capacity_bytes) < 0.15

Why this matters

Warns when persistent volumes are running low on space (less than 15% available).

Tuning tips

You may want to forecast available space over time, e.g., by estimating consumption rate, to catch storage issues earlier.

KubeCPUOvercommit

Service availability check

sum(kube_resourcequota{type="hard", resource="limits.cpu"}) < sum(kube_node_status_allocatable_cpu_cores)

Why this matters

Checks if CPU limits set in namespaces exceed the cluster's allocatable CPU, flagging potential overcommitment.

Tuning tips

This is a conservative check; validate it against actual usage and consider relaxing it in high-density environments that operate close to full utilization.

KubeMemoryOvercommit

Service availability check

sum(kube_resourcequota{type="hard", resource="limits.memory"}) < sum(kube_node_status_allocatable_memory_bytes)

Why this matters

Similar to the CPU overcommit alert but checks for memory resource claims exceeding available node memory.

Tuning tips

For burstable workloads, some overcommitment is acceptable. Tune this alert to fire closer to 90–100% to avoid unnecessary noise.

KubeNodeNotReady

Service availability check

kube_node_status_condition{condition="Ready", status="true"} == 0

Why this matters

Triggers when a Kubernetes node is not in a Ready state, potentially due to system failure or network issues.

Tuning tips

Set up a for-duration (e.g., 5m) to prevent transient Ready state drops from triggering alerts. Prioritize nodes in production environments.

Service availability check

Why this matters

Tuning tips

Service availability check

Why this matters

Tuning tips

Quick Setup

Clone the repository from https://github.com/DrDroidLab/prometheus-alert-templates.

Navigate to the 'kubernetes' directory and include the alert rule files in your Prometheus rule configuration under 'rule_files'.

Reload Prometheus to apply the new rules. Ensure Alertmanager is configured to receive and route alerts accordingly.

Frequently Asked Questions

Ready to Get Started?

Get started with production-ready Kubernetes monitoring by integrating the alert templates from DrDroidLab: https://github.com/DrDroidLab/prometheus-alert-templates/blob/master/kubernetes

Get Template