An error budget is a concept for defining acceptable reliability limits and managing it’s variance. It represents the acceptable amount of errors or downtime that a service can experience within a defined period of time. It is like a limit on how many mistakes or errors your software can have before it is unreliable and is considered a problem for the engineering team to manage.
Error budget helps in prioritization and in ensuring most critical problems are dealt with first. It also prevents the accumulation of the technical debt that can slow down the development over time.
Engineering teams create error budget policies to protect customers from repeated misses in SLOs.
Here’s an example of an error budget policy by Google SRE.