Production-Ready Template

Effective Cassandra Monitoring with Prometheus Alerting Rules

Monitoring Apache Cassandra is critical for ensuring high availability and performance of your distributed database system. This blog explores a curated set of Prometheus alerting rules specifically designed for monitoring Cassandra, sourced from the open-source template at DrDroidLab/prometheus-alert-templates. It highlights key alerts, explains what they detect using PromQL, and provides tuning guidance to help SRE teams and DevOps engineers adapt alerts to their environment.

Core Alert Rule

CassandraAvailabilityLow
Critical Performance Bottleneck
sum(last_over_time(cassandra_stats_up[30s])) < 1
Why this matters
This rule checks whether Cassandra is reporting as available. If no 'up' metric is received from Cassandra in the last 30 seconds, it triggers the alert, indicating a service downtime or connectivity issue.
Tuning tips
Adjust the time window (e.g., 30s) based on how often your Prometheus scrapes Cassandra metrics. Increase duration to reduce flapping in choppy networks.
CassandraHeapMemoryUsageHigh
Operations blocking event loop
avg_over_time(jvm_memory_bytes_used{area="heap"}[5m]) / avg_over_time(jvm_memory_bytes_max{area="heap"}[5m]) > 0.9
Why this matters
This rule detects when heap memory usage consistently exceeds 90% over a 5-minute window, a potential sign of memory leaks or high GC pressure.
Tuning tips
Tune the threshold (0.9) based on typical workload profiles. Production workloads may have safe ranges closer to 80%.
CassandraReadLatencyHigh
Memory efficiency warning
rate(cassandra_table_read_latency_total[5m]) > 0.01
Why this matters
This rule monitors the read latency across tables and alerts when the rate exceeds 0.01 (depends on latency metrics' units). High read latency can point to overloaded nodes or disk I/O problems.
Tuning tips
Base this threshold on historical latency baselines. Consider alerting per table to isolate problematic areas.
CassandraWriteLatencyHigh
Service availability check
rate(cassandra_table_write_latency_total[5m]) > 0.01
Why this matters
This alert fires when write latency exceeds acceptable thresholds, indicating issues with write paths, such as compactions or network pressure.
Tuning tips
As with read latency, thresholds should be adjusted based on SLAs and normal workload behavior.
CassandraDroppedMessages
Service availability check
increase(cassandra_dropped_messages_total[5m]) > 0
Why this matters
Fires if there are dropped messages in the last 5 minutes. This typically signals resource bottlenecks or misconfigured network layers.
Tuning tips
If frequent false positives occur, increase threshold or time window (e.g., [10m]) depending on message rate.
CassandraDiskUsageHigh
Service availability check
node_filesystem_avail_bytes{mountpoint="/var/lib/cassandra"} / node_filesystem_size_bytes{mountpoint="/var/lib/cassandra"} < 0.15
Why this matters
Triggers when disk usage on the Cassandra data directory gets dangerously close to full (less than 15% free). This is essential for preventing write failures.
Tuning tips
Customize mountpoint if data is on a different path. Adjust threshold based on storage policies (e.g., set to 0.20 for proactive alerts).
Service availability check
Why this matters
Tuning tips
Service availability check
Why this matters
Tuning tips

Quick Setup

1
Clone the repository from https://github.com/DrDroidLab/prometheus-alert-templates
2
Copy the 'cassandra' alert rules file into your Prometheus rules directory
3
Include the rule file in your Prometheus configuration under 'rule_files'
4
Reload Prometheus or restart to apply the new rules
5
Ensure Alertmanager is configured to handle alert notifications triggered by these rules

Frequently Asked Questions

Do these alerts require a specific Cassandra exporter?
Can I scope alerts to a specific Cassandra keyspace or node?
How do I prevent alert flapping due to short spikes?
Where should I define notification channels for these alerts?

Ready to Get Started?

Use the Cassandra alerting template from DrDroidLab/prometheus-alert-templates to proactively monitor and protect your Cassandra clusters from critical issues. Start by cloning the repo and integrating the alert rules into your observability stack.

SOC 2 Type II
certifed
ISO 27001
certified
Deep Sea Tech Inc. — Made with ❤️ in Bangalore & San Francisco 🏢

Doctor Droid