Production-Ready Template

Effective Cassandra Monitoring with Prometheus Alerting Rules

Monitoring Apache Cassandra is critical for ensuring high availability and performance of your distributed database system. This blog explores a curated set of Prometheus alerting rules specifically designed for monitoring Cassandra, sourced from the open-source template at DrDroidLab/prometheus-alert-templates. It highlights key alerts, explains what they detect using PromQL, and provides tuning guidance to help SRE teams and DevOps engineers adapt alerts to their environment.

Get Template

Core Alert Rule

CassandraAvailabilityLow

Critical Performance Bottleneck

sum(last_over_time(cassandra_stats_up[30s])) < 1

Why this matters

This rule checks whether Cassandra is reporting as available. If no 'up' metric is received from Cassandra in the last 30 seconds, it triggers the alert, indicating a service downtime or connectivity issue.

Tuning tips

Adjust the time window (e.g., 30s) based on how often your Prometheus scrapes Cassandra metrics. Increase duration to reduce flapping in choppy networks.

CassandraHeapMemoryUsageHigh

Operations blocking event loop

avg_over_time(jvm_memory_bytes_used{area="heap"}[5m]) / avg_over_time(jvm_memory_bytes_max{area="heap"}[5m]) > 0.9

Why this matters

This rule detects when heap memory usage consistently exceeds 90% over a 5-minute window, a potential sign of memory leaks or high GC pressure.

Tuning tips

Tune the threshold (0.9) based on typical workload profiles. Production workloads may have safe ranges closer to 80%.

CassandraReadLatencyHigh

Memory efficiency warning

rate(cassandra_table_read_latency_total[5m]) > 0.01

Why this matters

This rule monitors the read latency across tables and alerts when the rate exceeds 0.01 (depends on latency metrics' units). High read latency can point to overloaded nodes or disk I/O problems.

Tuning tips

Base this threshold on historical latency baselines. Consider alerting per table to isolate problematic areas.

CassandraWriteLatencyHigh

Service availability check

rate(cassandra_table_write_latency_total[5m]) > 0.01

Why this matters

This alert fires when write latency exceeds acceptable thresholds, indicating issues with write paths, such as compactions or network pressure.

Tuning tips

As with read latency, thresholds should be adjusted based on SLAs and normal workload behavior.

CassandraDroppedMessages

Service availability check

increase(cassandra_dropped_messages_total[5m]) > 0

Why this matters

Fires if there are dropped messages in the last 5 minutes. This typically signals resource bottlenecks or misconfigured network layers.

Tuning tips

If frequent false positives occur, increase threshold or time window (e.g., [10m]) depending on message rate.

CassandraDiskUsageHigh

Service availability check

node_filesystem_avail_bytes{mountpoint="/var/lib/cassandra"} / node_filesystem_size_bytes{mountpoint="/var/lib/cassandra"} < 0.15

Why this matters

Triggers when disk usage on the Cassandra data directory gets dangerously close to full (less than 15% free). This is essential for preventing write failures.

Tuning tips

Customize mountpoint if data is on a different path. Adjust threshold based on storage policies (e.g., set to 0.20 for proactive alerts).

Service availability check

Why this matters

Tuning tips

Service availability check

Why this matters

Tuning tips

Quick Setup

Clone the repository from https://github.com/DrDroidLab/prometheus-alert-templates

Copy the 'cassandra' alert rules file into your Prometheus rules directory

Include the rule file in your Prometheus configuration under 'rule_files'

Reload Prometheus or restart to apply the new rules

Ensure Alertmanager is configured to handle alert notifications triggered by these rules

Frequently Asked Questions

Ready to Get Started?

Use the Cassandra alerting template from DrDroidLab/prometheus-alert-templates to proactively monitor and protect your Cassandra clusters from critical issues. Start by cloning the repo and integrating the alert rules into your observability stack.

Get Template