Prometheus is an open-source systems monitoring and alerting toolkit originally built at SoundCloud. It is designed to record real-time metrics in a time series database, built using a highly dimensional data model. Prometheus is a powerful tool for monitoring and alerting, providing a flexible query language to leverage its multi-dimensional data model.
One of the common issues encountered when using Prometheus is a label cardinality explosion. This problem manifests as high memory usage and slow query performance. Users may notice that their Prometheus server is consuming an excessive amount of resources, or queries are taking longer than expected to execute.
When label cardinality explosion occurs, you might observe the following symptoms:
Label cardinality explosion happens when there are too many unique label combinations in your metrics. Prometheus stores each unique combination of labels as a separate time series. If you have labels that generate a large number of unique combinations, it can lead to high cardinality, which in turn causes performance issues.
The root cause of this issue is often the use of labels that have a high number of unique values, such as user IDs, request IDs, or other identifiers that change frequently. These labels can exponentially increase the number of time series stored in Prometheus.
To resolve the label cardinality explosion, follow these steps:
Use the following query to identify metrics with high cardinality:
count by (__name__)({__name__=~".*"})
This query will help you identify which metrics have a large number of time series.
Once you've identified the problematic metrics, consider reducing the number of unique labels. Avoid using labels with high cardinality such as user IDs or request IDs. Instead, use labels that have a limited set of possible values.
Implement relabeling rules in your prometheus.yml
configuration file to drop or modify labels that contribute to high cardinality. For example:
relabel_configs:
- source_labels: ["__name__"]
regex: "high_cardinality_metric"
action: drop
For more information on relabeling, refer to the Prometheus documentation.
After making changes, monitor your Prometheus server's performance and adjust your configuration as needed. Continuously review your metrics and labels to ensure they remain efficient.
By understanding and addressing label cardinality explosion, you can significantly improve the performance and reliability of your Prometheus monitoring setup. Regularly review your metrics and labels, and make use of Prometheus's powerful configuration options to maintain an efficient monitoring system.
Let Dr. Droid create custom investigation plans for your infrastructure.
Start Free POC (15-min setup) →