Notes by Doctor Droid

3 Upcoming Trends in Platform Engineering Teams

ยท

4 min read

What innovations are the best SREs, Platform Teams & Observability Teams doing today?

Cover Image for 3 Upcoming Trends in Platform Engineering Teams

The practices of engineering at scale have evolved significantly over the past decade. There's been a gradual transition from the times of SysAdmins to DevOps to today's emphasis on platform engineering. There are multiple reasons for this transition that I will cover in another article about the evolution of Platform Engineering.

Due to the nature of our work at Doctor Droid, I collaborate closely with platform teams at enterprises & startups. In this article, I'm covering a few frequently observed products in the platform engineering teams:

Trend #1: Moving to Grafana + Loki for Logs from their existing monitoring stack.

Despite the increasing ease of setting up observability & monitoring, cloud observability tools have not become any cheaper. Due to this, there's an increased drive for teams to move towards the popular Open Source stack (LGTM).

Read my Linkedin post on how Coinbase spent $65M on Datadog

Yes it is cheaper + easier to manage OSS but you need to know this:

  1. Loki only indexes labels or timestamp. You can search with a string but it's not optimised so if custom string search is a rare scenario, then it's alright to switch to Loki. Otherwise, you might end up significantly reduce Developer Experience (Dx).

  2. Loki works great if your team depends more on metrics for monitoring / debugging than logs -- so make sure to evaluate your metrics practices before deciding to jump on the Loki bandwagon. :D

Trend #2: Observability is gotten easier โžœ alerting has become noisier.

With increasing ease of instrumentation and reducing cost of telemetry data storage, teams today are storing more data than ever.

Often teams are orchestrating alerts on metrics through codified scripts -- this is causing a bloat of alerts and make it harder for teams to differentiate noise from a useful alert.

Some measures that I've seen teams do to manage this growing noise:

  1. Setup analytics on alerts to identify noisy alerts and discard them

  2. Make on-call engineer responsible to turn off / report on every noisy alert at the end of their rotation

  3. Bi-weekly engineering meetings to discuss the alerts from last 2 weeks -- mark actionability against each of them and create Action Items to either take action against the root cause of the alert (if it was a concerning alert) or remove/change threshold (if it was not useful). Here's an example of a dashboard that can be used in bi-weekly reports.

Trend #3: Focus on Developer Experience.

Developer experience was hardly solved for, until a few years ago.

Omar Qazi on LinkedIn: #memes

Today, things have changed:

What are different teams doing to improve developer experience?

1. ๐—œ๐—ป๐˜๐—ฒ๐—ฟ๐—ป๐—ฎ๐—น ๐——๐—ฒ๐˜ƒ๐—ฒ๐—น๐—ผ๐—ฝ๐—ฒ๐—ฟ ๐—ฃ๐—ผ๐—ฟ๐˜๐—ฎ๐—น ๐—œ๐—บ๐—ฝ๐—น๐—ฒ๐—บ๐—ฒ๐—ป๐˜๐—ฎ๐˜๐—ถ๐—ผ๐—ป๐˜€: A singular layer to access information about all technical things from services list to documentation repos.

2. ๐—ฆ๐—ฒ๐—น๐—ณ-๐˜€๐—ฒ๐—ฟ๐˜ƒ๐—ถ๐—ฐ๐—ฒ ๐—ฐ๐—ฎ๐˜๐—ฎ๐—น๐—ผ๐—ด๐—ถ๐—ป๐—ด: Need a VM for some dev testing? Sure go ahead and do it yourself by filling this form. Need to spin up a new service? Sure. All the boilerplating is done. Read this blog discussing most popular service catalogue tools.

3. ๐—”๐˜‚๐˜๐—ผ๐—บ๐—ฎ๐˜๐—ถ๐—ผ๐—ป ๐—ถ๐—ป๐˜€๐˜๐—ฒ๐—ฎ๐—ฑ ๐—ผ๐—ณ ๐—บ๐—ฎ๐—ป๐˜‚๐—ฎ๐—น ๐—ฝ๐—ฟ๐—ผ๐—ฐ๐—ฒ๐˜€๐˜€๐—ฒ๐˜€: From Github Actions blocking users from adding new services to Production if they lack prometheus metrics to blocking PRs/MRs with poor logging practices, teams at scale are trying to make it part GitOps oriented processes. If you want to dive deeper, check out this blog on how Palantir implemented GitOps internally.

4. ๐——๐—ฎ๐˜๐—ฎ ๐——๐—ฟ๐—ถ๐˜ƒ๐—ฒ๐—ป ๐—˜๐—ป๐—ณ๐—ผ๐—ฟ๐—ฐ๐—ฒ๐—บ๐—ฒ๐—ป๐˜๐˜€: Tool usage patterns, infrastructure & observability cost estimates, etc. -- Platform teams are gamifying the information for developers by making it democratically available and letting them figure out how to bring it within the org's budget instead of trying to use a carrot-stick approach. (I know a few companies that have team level notifications if their "logging budget" is going to be exhausted before time for the month -- it's quite cool, how easy it makes it for others to be informed and take quick decisions).

Conclusion:

The next 5 years will be very exciting to see the evolution of platform teams -- be it devops, observability or DevEx. If you are working in platform teams, I'd love to hear about projects that your team is working on.

At Doctor Droid, we are building tools for improving observability, monitoring and on-call tasks. If you're spending time on similar problems, don't forget to checkout our website or github repository!