Production-Ready Template
Production-Ready Kubernetes Monitoring with Prometheus: Alert Templates That Work
Monitoring Kubernetes at scale is critical for reliability and uptime. Prometheus offers powerful metric collection and alerting capabilities, but crafting effective alerts is challenging. The open-source template from DrDroidLab provides a curated set of production-grade alert rules for Kubernetes, helping SREs and platform engineers detect issues early and respond quickly. This post breaks down the key alert rules, explains how to use them, and offers tuning guidance for adapting them to your Kubernetes workloads.
Core Alert Rule

KubePodCrashLooping
Critical Performance Bottleneck
sum(increase(kube_pod_container_status_restarts_total[15m])) by (pod, namespace) > 5
Why this matters
Fires when a pod is crashing frequently (more than 5 restarts in 15 minutes), catching applications stuck in a CrashLoopBackOff state.
Tuning tips
Adjust the threshold (e.g., 5 restarts) and the time window (15m) to match the expected stability of your workloads—more tolerant for volatile development pods, more aggressive for production services.

KubePodNotReady
Operations blocking event loop
kube_pod_status_ready{condition="false"} == 1
Why this matters
Detects pods that are not in a ready state, indicating they're unhealthy or failing readiness checks.
Tuning tips
Consider filtering by namespace or critical workload labels to reduce noise from less critical pods.

KubePersistentVolumeFullInFourDays
Memory efficiency warning
(kubelet_volume_stats_available_bytes / kubelet_volume_stats_capacity_bytes) < 0.15
Why this matters
Warns when persistent volumes are running low on space (less than 15% available).
Tuning tips
You may want to forecast available space over time, e.g., by estimating consumption rate, to catch storage issues earlier.

KubeCPUOvercommit
Service availability check
sum(kube_resourcequota{type="hard", resource="limits.cpu"}) < sum(kube_node_status_allocatable_cpu_cores)
Why this matters
Checks if CPU limits set in namespaces exceed the cluster's allocatable CPU, flagging potential overcommitment.
Tuning tips
This is a conservative check; validate it against actual usage and consider relaxing it in high-density environments that operate close to full utilization.

KubeMemoryOvercommit
Service availability check
sum(kube_resourcequota{type="hard", resource="limits.memory"}) < sum(kube_node_status_allocatable_memory_bytes)
Why this matters
Similar to the CPU overcommit alert but checks for memory resource claims exceeding available node memory.
Tuning tips
For burstable workloads, some overcommitment is acceptable. Tune this alert to fire closer to 90–100% to avoid unnecessary noise.

KubeNodeNotReady
Service availability check
kube_node_status_condition{condition="Ready", status="true"} == 0
Why this matters
Triggers when a Kubernetes node is not in a Ready state, potentially due to system failure or network issues.
Tuning tips
Set up a for-duration (e.g., 5m) to prevent transient Ready state drops from triggering alerts. Prioritize nodes in production environments.

Service availability check
Why this matters
Tuning tips

Service availability check
Why this matters
Tuning tips
Quick Setup
1
Clone the repository from https://github.com/DrDroidLab/prometheus-alert-templates.
2
Navigate to the 'kubernetes' directory and include the alert rule files in your Prometheus rule configuration under 'rule_files'.
3
Reload Prometheus to apply the new rules. Ensure Alertmanager is configured to receive and route alerts accordingly.
4
5
Frequently Asked Questions
Are these alert rules compatible with kube-prometheus-stack?
Yes, these rules are designed for use with standard Kubernetes metrics exposed via kube-state-metrics and kubelet. Integration should work with kube-prometheus-stack out of the box.
How often do the alerts evaluate?
Prometheus typically evaluates rules at a 1-minute interval, but you can tune this via the 'evaluation_interval' setting in your Prometheus config.
Is it safe to use these alerts in production?
Yes, but you should tune thresholds and filters based on your workload characteristics and alert fatigue tolerance.
Can I disable some alerts?
Absolutely. You can cherry-pick or comment out rules in the alert file YAML to match your team's monitoring priorities.
Ready to Get Started?
Get started with production-ready Kubernetes monitoring by integrating the alert templates from DrDroidLab: https://github.com/DrDroidLab/prometheus-alert-templates/blob/master/kubernetes


SOC 2 Type II
certifed
certifed

ISO 27001
certified
certified
Deep Sea Tech Inc. — Made with ❤️ in Bangalore & San Francisco 🏢