Prometheus Label cardinality explosion

Too many unique label combinations causing high cardinality.

Understanding Prometheus and Its Purpose

Prometheus is an open-source systems monitoring and alerting toolkit originally built at SoundCloud. It is designed to record real-time metrics in a time series database, built using a highly dimensional data model. Prometheus is a powerful tool for monitoring and alerting, providing a flexible query language to leverage its multi-dimensional data model.

Recognizing the Symptom: Label Cardinality Explosion

One of the common issues encountered when using Prometheus is a label cardinality explosion. This problem manifests as high memory usage and slow query performance. Users may notice that their Prometheus server is consuming an excessive amount of resources, or queries are taking longer than expected to execute.

What is Observed?

When label cardinality explosion occurs, you might observe the following symptoms:

  • Increased memory consumption by the Prometheus server.
  • Slower query execution times.
  • Potential out-of-memory (OOM) errors.

Explaining the Issue: High Cardinality

Label cardinality explosion happens when there are too many unique label combinations in your metrics. Prometheus stores each unique combination of labels as a separate time series. If you have labels that generate a large number of unique combinations, it can lead to high cardinality, which in turn causes performance issues.

Root Cause Analysis

The root cause of this issue is often the use of labels that have a high number of unique values, such as user IDs, request IDs, or other identifiers that change frequently. These labels can exponentially increase the number of time series stored in Prometheus.

Steps to Fix the Issue

To resolve the label cardinality explosion, follow these steps:

1. Identify High-Cardinality Labels

Use the following query to identify metrics with high cardinality:

count by (__name__)({__name__=~".*"})

This query will help you identify which metrics have a large number of time series.

2. Reduce Unique Labels

Once you've identified the problematic metrics, consider reducing the number of unique labels. Avoid using labels with high cardinality such as user IDs or request IDs. Instead, use labels that have a limited set of possible values.

3. Use Relabeling Rules

Implement relabeling rules in your prometheus.yml configuration file to drop or modify labels that contribute to high cardinality. For example:

relabel_configs:
- source_labels: ["__name__"]
regex: "high_cardinality_metric"
action: drop

For more information on relabeling, refer to the Prometheus documentation.

4. Monitor and Adjust

After making changes, monitor your Prometheus server's performance and adjust your configuration as needed. Continuously review your metrics and labels to ensure they remain efficient.

Conclusion

By understanding and addressing label cardinality explosion, you can significantly improve the performance and reliability of your Prometheus monitoring setup. Regularly review your metrics and labels, and make use of Prometheus's powerful configuration options to maintain an efficient monitoring system.

Never debug

Prometheus

manually again

Let Dr. Droid create custom investigation plans for your infrastructure.

Start Free POC (15-min setup) →
Automate Debugging for
Prometheus
See how Dr. Droid creates investigation plans for your infrastructure.

MORE ISSUES

Made with ❤️ in Bangalore & San Francisco 🏢

Doctor Droid