3 Upcoming Trends in Platform Engineering Teams
4 min read
What innovations are the best SREs, Platform Teams & Observability Teams doing today?
The practices of engineering at scale have evolved significantly over the past decade. There's been a gradual transition from the times of SysAdmins to DevOps to today's emphasis on platform engineering. There are multiple reasons for this transition that I will cover in another article about the evolution of Platform Engineering.
Due to the nature of our work at Doctor Droid, I collaborate closely with platform teams at enterprises & startups. In this article, I'm covering a few frequently observed products in the platform engineering teams:
Trend #1: Moving to Grafana + Loki for Logs from their existing monitoring stack.
Despite the increasing ease of setting up observability & monitoring, cloud observability tools have not become any cheaper. Due to this, there's an increased drive for teams to move towards the popular Open Source stack (LGTM).
Yes it is cheaper + easier to manage OSS but you need to know this:
Loki only indexes labels or timestamp. You can search with a string but it's not optimised so if custom string search is a rare scenario, then it's alright to switch to Loki. Otherwise, you might end up significantly reduce Developer Experience (Dx).
Loki works great if your team depends more on metrics for monitoring / debugging than logs -- so make sure to evaluate your metrics practices before deciding to jump on the Loki bandwagon. :D
Trend #2: Observability is gotten easier โ alerting has become noisier.
With increasing ease of instrumentation and reducing cost of telemetry data storage, teams today are storing more data than ever.
Often teams are orchestrating alerts on metrics through codified scripts -- this is causing a bloat of alerts and make it harder for teams to differentiate noise from a useful alert.
Some measures that I've seen teams do to manage this growing noise:
Setup analytics on alerts to identify noisy alerts and discard them
Make on-call engineer responsible to turn off / report on every noisy alert at the end of their rotation
Bi-weekly engineering meetings to discuss the alerts from last 2 weeks -- mark actionability against each of them and create Action Items to either take action against the root cause of the alert (if it was a concerning alert) or remove/change threshold (if it was not useful). Here's an example of a dashboard that can be used in bi-weekly reports.
Trend #3: Focus on Developer Experience.
Developer experience was hardly solved for, until a few years ago.
Today, things have changed:
What are different teams doing to improve developer experience?
1. ๐๐ป๐๐ฒ๐ฟ๐ป๐ฎ๐น ๐๐ฒ๐๐ฒ๐น๐ผ๐ฝ๐ฒ๐ฟ ๐ฃ๐ผ๐ฟ๐๐ฎ๐น ๐๐บ๐ฝ๐น๐ฒ๐บ๐ฒ๐ป๐๐ฎ๐๐ถ๐ผ๐ป๐: A singular layer to access information about all technical things from services list to documentation repos.
2. ๐ฆ๐ฒ๐น๐ณ-๐๐ฒ๐ฟ๐๐ถ๐ฐ๐ฒ ๐ฐ๐ฎ๐๐ฎ๐น๐ผ๐ด๐ถ๐ป๐ด: Need a VM for some dev testing? Sure go ahead and do it yourself by filling this form. Need to spin up a new service? Sure. All the boilerplating is done. Read this blog discussing most popular service catalogue tools.
3. ๐๐๐๐ผ๐บ๐ฎ๐๐ถ๐ผ๐ป ๐ถ๐ป๐๐๐ฒ๐ฎ๐ฑ ๐ผ๐ณ ๐บ๐ฎ๐ป๐๐ฎ๐น ๐ฝ๐ฟ๐ผ๐ฐ๐ฒ๐๐๐ฒ๐: From Github Actions blocking users from adding new services to Production if they lack prometheus metrics to blocking PRs/MRs with poor logging practices, teams at scale are trying to make it part GitOps oriented processes. If you want to dive deeper, check out this blog on how Palantir implemented GitOps internally.
4. ๐๐ฎ๐๐ฎ ๐๐ฟ๐ถ๐๐ฒ๐ป ๐๐ป๐ณ๐ผ๐ฟ๐ฐ๐ฒ๐บ๐ฒ๐ป๐๐: Tool usage patterns, infrastructure & observability cost estimates, etc. -- Platform teams are gamifying the information for developers by making it democratically available and letting them figure out how to bring it within the org's budget instead of trying to use a carrot-stick approach. (I know a few companies that have team level notifications if their "logging budget" is going to be exhausted before time for the month -- it's quite cool, how easy it makes it for others to be informed and take quick decisions).
Conclusion:
The next 5 years will be very exciting to see the evolution of platform teams -- be it devops, observability or DevEx. If you are working in platform teams, I'd love to hear about projects that your team is working on.
At Doctor Droid, we are building tools for improving observability, monitoring and on-call tasks. If you're spending time on similar problems, don't forget to checkout our website or github repository!