Production-Ready Template

Production-Ready Kubernetes Monitoring with Prometheus: Alert Templates That Work

Monitoring Kubernetes at scale is critical for reliability and uptime. Prometheus offers powerful metric collection and alerting capabilities, but crafting effective alerts is challenging. The open-source template from DrDroidLab provides a curated set of production-grade alert rules for Kubernetes, helping SREs and platform engineers detect issues early and respond quickly. This post breaks down the key alert rules, explains how to use them, and offers tuning guidance for adapting them to your Kubernetes workloads.

Core Alert Rule

KubePodCrashLooping
Critical Performance Bottleneck
sum(increase(kube_pod_container_status_restarts_total[15m])) by (pod, namespace) > 5
Why this matters
Fires when a pod is crashing frequently (more than 5 restarts in 15 minutes), catching applications stuck in a CrashLoopBackOff state.
Tuning tips
Adjust the threshold (e.g., 5 restarts) and the time window (15m) to match the expected stability of your workloads—more tolerant for volatile development pods, more aggressive for production services.
KubePodNotReady
Operations blocking event loop
kube_pod_status_ready{condition="false"} == 1
Why this matters
Detects pods that are not in a ready state, indicating they're unhealthy or failing readiness checks.
Tuning tips
Consider filtering by namespace or critical workload labels to reduce noise from less critical pods.
KubePersistentVolumeFullInFourDays
Memory efficiency warning
(kubelet_volume_stats_available_bytes / kubelet_volume_stats_capacity_bytes) < 0.15
Why this matters
Warns when persistent volumes are running low on space (less than 15% available).
Tuning tips
You may want to forecast available space over time, e.g., by estimating consumption rate, to catch storage issues earlier.
KubeCPUOvercommit
Service availability check
sum(kube_resourcequota{type="hard", resource="limits.cpu"}) < sum(kube_node_status_allocatable_cpu_cores)
Why this matters
Checks if CPU limits set in namespaces exceed the cluster's allocatable CPU, flagging potential overcommitment.
Tuning tips
This is a conservative check; validate it against actual usage and consider relaxing it in high-density environments that operate close to full utilization.
KubeMemoryOvercommit
Service availability check
sum(kube_resourcequota{type="hard", resource="limits.memory"}) < sum(kube_node_status_allocatable_memory_bytes)
Why this matters
Similar to the CPU overcommit alert but checks for memory resource claims exceeding available node memory.
Tuning tips
For burstable workloads, some overcommitment is acceptable. Tune this alert to fire closer to 90–100% to avoid unnecessary noise.
KubeNodeNotReady
Service availability check
kube_node_status_condition{condition="Ready", status="true"} == 0
Why this matters
Triggers when a Kubernetes node is not in a Ready state, potentially due to system failure or network issues.
Tuning tips
Set up a for-duration (e.g., 5m) to prevent transient Ready state drops from triggering alerts. Prioritize nodes in production environments.
Service availability check
Why this matters
Tuning tips
Service availability check
Why this matters
Tuning tips

Quick Setup

1
Clone the repository from https://github.com/DrDroidLab/prometheus-alert-templates.
2
Navigate to the 'kubernetes' directory and include the alert rule files in your Prometheus rule configuration under 'rule_files'.
3
Reload Prometheus to apply the new rules. Ensure Alertmanager is configured to receive and route alerts accordingly.
4
5

Frequently Asked Questions

Are these alert rules compatible with kube-prometheus-stack?
How often do the alerts evaluate?
Is it safe to use these alerts in production?
Can I disable some alerts?

Ready to Get Started?

Get started with production-ready Kubernetes monitoring by integrating the alert templates from DrDroidLab: https://github.com/DrDroidLab/prometheus-alert-templates/blob/master/kubernetes

SOC 2 Type II
certifed
ISO 27001
certified
Deep Sea Tech Inc. — Made with ❤️ in Bangalore & San Francisco 🏢

Doctor Droid