Watch AI Investigation by Doctor Droid on 22nd October.

Checklist for Service Health Tracking

·

3 min read

A detailed checklist for engineering teams to track service health

Cover Image for Checklist for Service Health Tracking

Myth: Health of Service = Golden Signals.

Fact: It's not that simple.

Introduction

In an era where technology forms the core of both enterprise operations and consumer experiences, ensuring the robustness of these technological systems is not just beneficial — it's essential.

A robust health check mechanism gives user the full control on service reliability, performance, and user satisfaction. But what constitutes an ideal service health check? This blog post dives deeper into the critical components of an effective health monitoring system.

Golden Signals of Service

The simplest start of service monitoring are the golden signals: latency, traffic, errors, and saturation. These metrics offer a basic view of service performance and health. Latency measures response times (or "slowness"), traffic gauges demand (or "users"), errors track failures (or "Unexpected experiences for users"), and saturation assesses resource usage. Together, they provide a real-time snapshot of service health, enabling timely interventions when needed. You can also read this detailed blog on how to investigate APIs in production.

Upstream and Downstream Service Health

Architecture of microservices

No service is an island. Within an application, services depend on each other in complex ways. Monitoring the health of upstream (dependencies) and downstream (dependents) services is crucial. This interconnected monitoring ensures that issues in related services are identified and addressed before they impact your service or end-users.

Connected 3rd Party Integrations

An average tech startup can have anywhere between 5-50 tech providers deeply integrated within their production depending on use-cases. These third-party integrations enhance services with external functionalities but also introduce potential points of failure. It's the responsibility of an engineering team to ensure that they have enough failsafe / backup mechanisms in place that any degradation in their performance doesn't compromise the overall service quality. Here's an interesting tool to track third party integrations.

Business Metrics

While technical metrics are crucial, aligning them with business outcomes ensures that the service delivers value. Monitoring key business metrics as part of service health checks bridges the gap between technical performance and business objectives, ensuring that the service supports the broader goals of the organization.

Underlying Infrastructure Health

There are too many concepts in K8s to cover in a blog or even a book!

Image reference: Blog

Be it k8s, containers or VMs -- tracking the health status of these infrastructure "orchestrators" is critical. Often, monitoring the health of this infrastructure is as vital as monitoring the service itself. Not only that, just defining how you monitor k8s health itself is a deep enough topic that it could need a book, not a passage!

Microservices architecture: two months in, looking for feedback :  r/SpringBoot

Image Reference: Reddit post.

Look at this sample reference architecture from a reddit post. There are more infrastructure components than services -- mongoDB, PostgreSQL, RabbitMQ, Zipkin, et. al.

The health of these components directly impacts the performance and reliability of the service. Implementing targeted health checks for these components is critical, more so when these are self-hosted instead of being managed by Cloud Providers.

Conclusion

An ideal service health check is a comprehensive, multi-faceted approach that goes beyond surface-level metrics. It encompasses the golden signals, the health of connected services and components, third-party integrations, and underlying infrastructure, all while keeping an eye on the ultimate goal: delivering business value.

Doctor Droid:

Doctor Droid enables codification of monitoring logics that only you understand, so that you can share investigation strategies with team members with ease and make troubleshooting stress-free for both you & the on-call engineer! Check out here -- we also have a template for Service Health Monitoring.