The Problem of Plenty in Observability
3 min read
Discussing how the problems in monitoring & observability have evolved in the industry
It was quite a common issue about 5 years back to hear from teams facing the challenge that they don't have enough telemetry data to figure out how to debug a specific issue, leading to delay in investigation.
Today, the tables have turned and it's often the other way around. Most teams have over-invested in getting metrics, logs, traces and other telemetry data. But debugging and investigation still remains relatively hard.
Reasons for accelerated adoption of observability:
Over the duration of the last decade, the adoption of monitoring & observability data collection practices have significantly improved. I'd say this has been primarily because of three reasons:
Business requirements: Increasing pace of development as most industries have accelerated towards a real-time technology world where downtimes are getting close to unacceptable. Even industries that were traditionally related to low sensitivity today have higher dependency on cloud -- for both customer facing and internal usage.
Commercial Vendors: Companies like Datadog & New Relic accelerated the journey for companies to get their telemetry setup. With auto-instrumentation, out of the box integrations & dashboards, these companies enabled the path to getting visibility. Additionally, a new term was coined "Observability" which evangelised & emphasised on the need to be prepared for having data to solve for issues that are unknown and un-anticipated.
Easier Accessibility: With adoption of observability, came the challenge of bloating costs. Logs & metrics led to costs that became a significant chunk of cloud spends, even reaching upto 25%-30% of cloud spends in some enterprises. This drove the community towards building an open source ecosystem of tooling, which is free of vendor lock-in and helps teams to manage costs. Some of the most critical projects in this regard have been OpenTelemetry, Grafana (by Grafana Labs), Prometheus and more.
Is having data enough?
Often, doing observability right is not just about generating enough data. There are a couple of other things where you need to invest your time and energy to enable your team with a powerful tooling and debugging, at a fair cost:
Discoverability: Spending time to design dashboards with the right set of metrics and relevant data is as important as having the data.
Developer Experience: If you want your developers to be productive and fix on-call issues fast, having a good developer experience is essential. Some anti-patterns that I have noticed over the past couple of years:
Making data hard to access: If you need to ssh into a jump server, then a machine, then run a bunch of kubectl commands to fetch container logs, developers might skip checking it until they check everything else.
Making data accessible but not useable: Disabling log search by string and only allowing timeseries or a specific kv pair might be cost efficient but in case your developers need to often look at it.
Data silos: Having telemetry data access restricted can significantly slow down the investigation process as multiple people will need to get involved to even move the needle on a task.
Docs silos: Having docs of your team accessible to other team member could mean that they might be able to do a first level of debugging without even requiring the help of engineering team member.
Starting Points & Playbooks: Well, when the engineer is on-call and trying to fix an issue, having infinite data is more of a con than a pro. How does one actually start at the right point? So that the incident can be resolved at the earliest? Senior engineers still need to make sure they create a good way to guide on-call developers. Automated playbooks for common issues and actions taken during investigation can make it easier for on-call engineers to investigate & fix issues without escalations.
Doctor Droid PlayBooks:
Automate investigation & remediation during production incidents by leveraging Doctor Droid. This helps significantly reduce the time it takes to investigate and fix issues.