Prometheus High memory usage

Prometheus is consuming excessive memory due to high cardinality metrics or large retention settings.

Understanding Prometheus and Its Purpose

Prometheus is an open-source systems monitoring and alerting toolkit originally built at SoundCloud. It is now a standalone open source project and maintained independently of any company. Prometheus collects metrics from configured targets at given intervals, evaluates rule expressions, displays the results, and can trigger alerts if some condition is observed to be true.

For more information about Prometheus, you can visit the official Prometheus website.

Identifying the Symptom: High Memory Usage

One common issue users may encounter when using Prometheus is high memory usage. This can manifest as the Prometheus server consuming a large amount of RAM, potentially leading to performance degradation or even system crashes if not addressed.

Exploring the Issue: Causes of High Memory Usage

The primary cause of high memory usage in Prometheus is often related to high cardinality metrics or large retention settings. High cardinality refers to the number of unique time series that Prometheus is tracking. Each unique combination of metric name and label set creates a new time series, which can quickly add up and consume significant memory resources.

Additionally, if the retention period for metrics is set too long, Prometheus will store more data, further increasing memory usage. You can learn more about metric cardinality from the Prometheus naming best practices.

Steps to Fix High Memory Usage

1. Reduce the Retention Period

One of the first steps to mitigate high memory usage is to reduce the retention period for your metrics. This can be done by adjusting the --storage.tsdb.retention.time flag in your Prometheus configuration. For example, to set the retention period to 15 days, you can modify your Prometheus startup command as follows:

./prometheus --config.file=prometheus.yml --storage.tsdb.retention.time=15d

For more details on configuring retention, refer to the Prometheus storage documentation.

2. Optimize Queries

Another approach is to optimize your Prometheus queries to reduce the number of time series being queried. This can be achieved by using more selective label matchers or by aggregating data where possible. For example, instead of querying all instances of a metric, you might aggregate them by a specific label:

sum(rate(http_requests_total[5m])) by (job)

This query aggregates the request rate by job, reducing the number of time series returned.

3. Limit the Number of Time Series

To further control memory usage, consider limiting the number of time series that Prometheus tracks. This can be done by carefully designing your metric labels to avoid unnecessary cardinality. Avoid using labels with high cardinality, such as user IDs or request IDs, unless absolutely necessary.

For more guidance on managing time series, check out the Prometheus label best practices.

Conclusion

By understanding the causes of high memory usage in Prometheus and implementing these strategies, you can effectively manage and reduce memory consumption. This will help ensure that your Prometheus server runs efficiently and continues to provide valuable monitoring and alerting capabilities for your systems.

Never debug

Prometheus

manually again

Let Dr. Droid create custom investigation plans for your infrastructure.

Book Demo
Automate Debugging for
Prometheus
See how Dr. Droid creates investigation plans for your infrastructure.

MORE ISSUES

Made with ❤️ in Bangalore & San Francisco 🏢

Doctor Droid