Error budget

An error budget is the maximum amount of unreliability a service can tolerate while still meeting its reliability target. If a service has a 99.9% availability target, the error budget is the remaining 0.1%-roughly 8.7 hours of downtime per year. The error budget represents a deliberate acknowledgment that perfection is impossible and expensive, creating explicit space for the failures that inevitably accompany change and growth.

Why it matters

Traditional approaches to reliability create tension between engineering and operations. Engineers want to ship features quickly; operations wants to minimize change because change causes outages. Without a framework for resolving this tension, organizations either move too slowly (over-prioritizing stability) or break too often (over-prioritizing velocity).

Error budgets resolve this tension by making reliability a resource to be spent, not just a constraint to be satisfied. When the error budget has remaining capacity, teams have room to take risks-deploy more frequently, experiment with new technologies, accept some instability for the sake of progress. When the error budget is exhausted, the priority shifts to stability work until reliability recovers.

This framing transforms reliability from an abstract goal into a concrete, measurable resource that can be managed like any other.

How error budgets work

Calculating the budget

Error budgets are derived from Service Level Objectives (SLOs)-the reliability targets a team commits to achieving.

If the SLO is 99.9% availability:

Error budget = 100% - 99.9% = 0.1%

Over a year: 0.1% × 8,760 hours = 8.76 hours of allowed downtime

Over a month: 0.1% × 730 hours = 43.8 minutes of allowed downtime

The same logic applies to other reliability metrics:

If the SLO is 99.9% of requests under 200ms latency, the error budget allows 0.1% of requests to exceed that threshold

If the SLO is 99.95% successful requests, the error budget allows 0.05% failures

Spending the budget

Error budget gets consumed when the service fails to meet its SLO. An outage that causes 30 minutes of downtime consumes 30 minutes of the monthly error budget. Latency spikes that affect 0.03% of requests consume 0.03 percentage points of a latency error budget.

Tracking error budget consumption shows whether the team is operating within acceptable bounds. A team that consistently exhausts its budget is either too aggressive with changes or has systemic reliability problems. A team that never touches its budget might be too conservative, missing opportunities to ship faster.

Budget policy

The real power of error budgets comes from policies that govern what happens as budget depletes. Common policies include:

Feature freeze when the budget is exhausted. No new features deploy until reliability recovers and budget accumulates. This creates strong incentive to prioritize reliability work.

Change velocity reduction as budget depletes. When 50% of the budget is consumed, deployment frequency might drop. When 75% is consumed, only critical fixes deploy.

Reliability investment triggers when budget depletion exceeds thresholds. Hitting 80% budget consumption might trigger automatic allocation of engineering time to reliability work.

The specific policies vary by organization, but the principle remains: error budget consumption should influence behavior, not just be observed.

Error budgets in practice

Setting appropriate targets

Error budget size depends on the SLO, which should reflect actual user needs rather than aspirational perfection.

A 99.99% target (52 minutes/year error budget) makes sense for payment processing systems where any failure has immediate financial impact. A 99.9% target (8.7 hours/year) might be appropriate for most web applications. A 99% target (87.6 hours/year) could be acceptable for internal tools or batch processing systems.

Setting targets too high creates error budgets too small to permit meaningful development velocity. Teams can't ship anything without exhausting their budget, so they either ignore the budget or grind to a halt. Setting targets too low creates error budgets so large they provide no useful constraint.

The right target balances user expectations, business impact of failures, and the velocity the team needs to maintain.

Monitoring and reporting

Error budget tracking requires reliable measurement of the underlying SLIs (Service Level Indicators). Without accurate data on availability, latency, or error rates, error budget calculations are meaningless.

Most teams display error budget status prominently-on dashboards, in status pages, and in regular reporting. Visibility keeps reliability top of mind and helps stakeholders understand when velocity constraints are necessary.

Common visualizations include:

Remaining budget as a percentage or absolute time

Budget burn rate (how quickly budget is being consumed)

Projected budget exhaustion date at current burn rate

Historical budget consumption trends

Organizational alignment

Error budgets work best when the entire organization understands and respects them. Product managers who demand features despite exhausted error budgets undermine the system. Engineers who ignore budget constraints defeat the purpose. Leadership that overrides budget policies teaches teams that the system doesn't matter.

Successful implementation requires buy-in across functions. Product understands that reliability is a feature. Engineering commits to respecting budget constraints. Leadership supports decisions that prioritize reliability when budgets are depleted.

Error budgets and product development

For product teams, error budgets provide clarity about an often-murky trade-off.

Roadmap impact becomes predictable. If the team has ample error budget, aggressive feature delivery is possible. If the budget is depleted, reliability work takes priority. This isn't arbitrary-it's a principled response to measurable conditions.

Risk assessment improves. Launching a major feature might consume error budget due to initial instability. Is that consumption acceptable given current budget status? The framework provides language for this conversation.

Quality investment gains concrete justification. When reliability work competes with features, error budget status provides evidence. "We need to invest in reliability because we've consumed 90% of our error budget" is more compelling than "we should really improve reliability."

Customer impact becomes explicit. Error budget ultimately represents acceptable impact on users. Managing the budget well means managing customer experience within defined bounds.

Common challenges

Gaming the system

Teams might artificially satisfy SLOs through accounting tricks-excluding certain failures, measuring at favorable times, or setting SLOs that are easy to meet. This defeats the purpose. SLOs should reflect genuine user experience, and error budgets should represent real reliability.

Ignoring the budget

When budgets are exceeded and no consequences follow, the system loses credibility. Teams learn that error budgets are aspirational rather than binding, and behavior doesn't change. Enforcement-even if uncomfortable-is essential.

Inappropriate scope

Error budgets work best for services with clear ownership and measurable reliability. Applying them to shared infrastructure, third-party dependencies, or complex multi-team systems requires careful consideration of how budget consumption gets attributed and who controls the response.

Over-indexing on budget

Error budgets are tools, not goals. A team that optimizes purely for budget preservation might avoid valuable experimentation or innovation. The budget exists to enable velocity, not to become the sole focus of engineering effort.

The bigger picture

Error budgets represent a mature approach to reliability-one that acknowledges perfection is neither possible nor desirable. By making unreliability explicit and manageable, error budgets enable organizations to make informed trade-offs between stability and progress.

For product teams, this framework provides a principled way to balance feature velocity against customer experience. When reliability is quantified and tracked, it can be managed alongside other product concerns rather than existing in perpetual tension with them.

Klero complements error budget practices by tracking customer feedback related to reliability issues. When outages or performance problems occur, Klero captures the customer impact-providing qualitative context that enriches quantitative error budget data.

MODULES

INSIGHTS

What is error budget? definition, examples & best practices

Error budget

Why it matters

How error budgets work

Calculating the budget

Spending the budget

Budget policy

Error budgets in practice

Setting appropriate targets

Monitoring and reporting

Organizational alignment

Error budgets and product development

Common challenges

Gaming the system

Ignoring the budget

Inappropriate scope

Over-indexing on budget

The bigger picture

Start collecting feedback today

What is error budget? definition, examples & best practices

Error budget

Why it matters

How error budgets work

Calculating the budget

Spending the budget

Budget policy

Error budgets in practice

Setting appropriate targets

Monitoring and reporting

Organizational alignment

Error budgets and product development

Common challenges

Gaming the system

Ignoring the budget

Inappropriate scope

Over-indexing on budget

The bigger picture

Related terms

Start collecting feedback today