How WorkIndia moved closer to their vision of Zero-Touch Production with DrDroid
WorkIndia is one of India's largest Jobs marketplace with 28M+ active users
Problem Context
WorkIndia has a large expanse of infrastructure and applications. Given their large scale of traffic, incidents can impact their customers adversely. They had setup on-call processes and alerting to handle issues but there were multiple challenges that they were facing:
-
Manual investigations ended up taking upto 15-20 minutes for frequent alerts, where the engineer would need to jump across k8s, elastAPM, grafana dashboards, loki logs and code
-
Given their tool sprawl & context expanse, escalation during on-call was frequent and often a bottleneck to identify and fix issues
-
Engineers who were not on-call, were frequently involved in production issues
-
On-call engineers would get stuck because they wouldn't have deep technical know-how of a specific component (e.g. k8s) or because they wouldn't know correlation with other components of the stack
The Vision
WorkIndia's CTO & tech team were working towards the philosophy of "Zero Touch Production". They were hands-on with AI, actively using and building agents in their product and wanted an agentic solution for on-call that would work well with their stack and reduce the burden on their engineers to investigate and debug production issues.
Trying DrDroid
One of their engineers came across DrDroid and after checking the demo, etc. they decided to try DrDroid. Their evaluation criteria was the following:
- (a) Relevant integrations: ElastAPM, Grafana, k8s, Pagerduty, Loki, Github
- (b) Slack-first workflows
- (c) Support for integration behind VPC
- (d) Well defined access management and security
After a short evaluation, they were able to identify that DrDroid fit their requirement best.
What did WorkIndia team achieve?
Using DrDroid, the WorkIndia team is now able to do the following:
-
(a) Their new / junior engineers are able to investigate any production alert in minutes without escalations
-
(b) Automatically take action and auto-resolve domain specific alerts with prompt based runbooks
-
(c) Manage their daily on-call retrospective to improve alert actionability via DrDroid
They are also looking to do the following:
-
Further improve their autonomous detection stack to detect failures and issues in their deployment pipelines before even alerts come up
-
Further enhance their operational efficiency by automating actions on more issues
Quotes
One time I woke up at 3am by a pager. I instantly asked DrDroid to investigate it and in a few minutes, I was able to close the issue directly from Slack.
DrDroid works amazing for initial investigation - it gives exact alerting traces that help me understand what's happening quickly. With the time I save on debugging, I can actually focus on implementing long-term fixes instead of just firefighting all the time.
Ready to achieve zero-touch production like WorkIndia?