Envoy Cluster Outlier Detection

Envoy has detected an outlier in the cluster and is ejecting it.

Understanding Envoy and Its Purpose

Envoy is a high-performance open-source edge and service proxy designed for cloud-native applications. It is used to manage network traffic in microservices architectures, providing features like load balancing, service discovery, and observability. Envoy is often deployed as a sidecar proxy in service mesh architectures, such as Istio, to enhance the reliability and security of service-to-service communication.

Identifying the Symptom: Cluster Outlier Detection

One of the common issues encountered in Envoy is the detection of outliers within a cluster. This symptom manifests when Envoy identifies a particular instance in a service cluster that is not performing as expected and subsequently ejects it from the load balancing pool. This behavior is part of Envoy's outlier detection mechanism, which aims to maintain the overall health and performance of the service.

What You Might Observe

When outlier detection occurs, you may notice increased latency, error rates, or a reduction in available instances for a particular service. Logs may show messages indicating that an instance has been ejected due to failing health checks or exceeding error thresholds.

Explaining the Issue: Outlier Detection Mechanism

Envoy's outlier detection is a feature that automatically identifies and ejects unhealthy instances from the load balancing pool. This is done to prevent these instances from affecting the overall performance and reliability of the service. The detection is based on configurable thresholds for errors and latency, and once an instance exceeds these thresholds, it is temporarily removed from the pool.

Root Cause Analysis

The root cause of an instance being ejected can vary. It could be due to network issues, resource exhaustion, application-level errors, or misconfigurations. Understanding the specific reason requires examining logs and metrics for the affected instance.

Steps to Fix the Issue

To resolve the issue of cluster outlier detection, follow these steps:

1. Investigate the Ejected Instance

Start by examining the logs and metrics of the ejected instance. Look for patterns or anomalies that could indicate the cause of the issue. Common areas to check include:

  • Network connectivity issues
  • Resource usage (CPU, memory)
  • Application errors or exceptions

2. Adjust Outlier Detection Settings

If the ejection was due to overly sensitive thresholds, consider adjusting the outlier detection settings in Envoy. This can be done by modifying the outlier_detection configuration in your Envoy cluster settings. For example:

{
"outlier_detection": {
"consecutive_5xx": 5,
"interval": "10s",
"base_ejection_time": "30s",
"max_ejection_percent": 50
}
}

Refer to the Envoy documentation for more details on configuring outlier detection.

3. Monitor and Test

After making adjustments, monitor the service to ensure stability. Use tools like Prometheus and Grafana to visualize metrics and confirm that the changes have resolved the issue without introducing new problems.

Conclusion

Envoy's outlier detection is a powerful feature for maintaining service reliability, but it requires careful configuration and monitoring. By understanding the root cause of ejections and adjusting settings appropriately, you can ensure that your services remain healthy and performant.

Never debug

Envoy

manually again

Let Dr. Droid create custom investigation plans for your infrastructure.

Book Demo
Automate Debugging for
Envoy
See how Dr. Droid creates investigation plans for your infrastructure.

MORE ISSUES

Made with ❤️ in Bangalore & San Francisco 🏢

Doctor Droid