Envoy is a high-performance open-source edge and service proxy designed for cloud-native applications. It is used to manage network traffic in microservices architectures, providing features like load balancing, service discovery, and observability. Envoy is often deployed as a sidecar proxy in service mesh architectures, such as Istio, to enhance the reliability and security of service-to-service communication.
One of the common issues encountered in Envoy is the detection of outliers within a cluster. This symptom manifests when Envoy identifies a particular instance in a service cluster that is not performing as expected and subsequently ejects it from the load balancing pool. This behavior is part of Envoy's outlier detection mechanism, which aims to maintain the overall health and performance of the service.
When outlier detection occurs, you may notice increased latency, error rates, or a reduction in available instances for a particular service. Logs may show messages indicating that an instance has been ejected due to failing health checks or exceeding error thresholds.
Envoy's outlier detection is a feature that automatically identifies and ejects unhealthy instances from the load balancing pool. This is done to prevent these instances from affecting the overall performance and reliability of the service. The detection is based on configurable thresholds for errors and latency, and once an instance exceeds these thresholds, it is temporarily removed from the pool.
The root cause of an instance being ejected can vary. It could be due to network issues, resource exhaustion, application-level errors, or misconfigurations. Understanding the specific reason requires examining logs and metrics for the affected instance.
To resolve the issue of cluster outlier detection, follow these steps:
Start by examining the logs and metrics of the ejected instance. Look for patterns or anomalies that could indicate the cause of the issue. Common areas to check include:
If the ejection was due to overly sensitive thresholds, consider adjusting the outlier detection settings in Envoy. This can be done by modifying the outlier_detection
configuration in your Envoy cluster settings. For example:
{
"outlier_detection": {
"consecutive_5xx": 5,
"interval": "10s",
"base_ejection_time": "30s",
"max_ejection_percent": 50
}
}
Refer to the Envoy documentation for more details on configuring outlier detection.
After making adjustments, monitor the service to ensure stability. Use tools like Prometheus and Grafana to visualize metrics and confirm that the changes have resolved the issue without introducing new problems.
Envoy's outlier detection is a powerful feature for maintaining service reliability, but it requires careful configuration and monitoring. By understanding the root cause of ejections and adjusting settings appropriately, you can ensure that your services remain healthy and performant.
Let Dr. Droid create custom investigation plans for your infrastructure.
Book Demo