Sli (service level indicator)

A Service Level Indicator is a carefully chosen metric that quantifies an aspect of service quality. SLIs are the building blocks of service reliability - they provide the measurements that SLOs (Service Level Objectives) target and SLAs (Service Level Agreements) guarantee. Good SLIs capture what actually matters to users: whether the service is available, fast, and correct.

Why it matters

You can't improve what you don't measure, and you can't commit to what you can't track. SLIs matter because they provide objective assessment, replacing subjective opinions with concrete data. They create the foundation for goals since SLOs and SLAs build on SLI measurements. They serve as the alerting basis for monitoring systems to detect problems. They support decisions through data-driven reliability choices. They work as communication tools providing shared language for discussing service quality. And they enable trend tracking to reveal improvement or degradation over time.

Sli characteristics

User-centric SLIs measure what users experience. Internal metrics like CPU utilization or queue depth are valuable for debugging but don't directly reflect user experience. User-facing SLIs measure whether the request succeeded, how long it took, whether the response was correct, and whether the user could complete their task.

Measurable SLIs must be quantifiable with a specific definition of what's being measured, reliable data collection, consistent calculation methodology, and reasonable precision.

Actionable SLIs should prompt response when they degrade, with clear investigation paths, existing remediation options, and the ability to verify changes.

Comparable SLIs should be consistent over time through the same measurement methodology, stable calculation, and historical comparability.

Common sli types

Availability measures whether the service responds at all. Request success rate is the percentage of requests that return successful responses, calculated as (Successful Requests / Total Requests) × 100%. Uptime is the percentage of time the service is operational, calculated as (Total Time - Downtime) / Total Time × 100%.

Latency measures how quickly the service responds. Response time percentiles (p50, p90, p95, p99) are more useful than averages because p50 (median) shows half of requests are faster and half slower, while p99 shows 99% of requests are faster than this value. Average latency is often misleading because outliers are hidden.

Error rate measures how often things go wrong. Error percentage is the proportion of requests returning errors, calculated as (Error Responses / Total Requests) × 100%. Errors might be HTTP 5xx, application errors, or domain-specific failures.

Throughput measures capacity and processing rate - requests per second (volume of traffic successfully handled) or items processed (for batch systems, units of work completed).

Correctness measures whether responses are accurate through data quality (percentage of responses with correct data) or consistency (how quickly systems converge on consistent state).

Freshness measures how current information is for data systems through staleness (age of data when served) or replication lag (delay in synchronizing across systems).

Choosing slis

Start with user journeys by mapping critical user flows and identifying what matters at each step. When a user loads a dashboard, latency and availability matter. When a user submits a form, error rate and latency matter. When a user receives a notification, freshness and correctness matter.

Fewer is better because too many SLIs dilute focus. Aim for 3-5 SLIs per service that capture the essential dimensions of quality.

Cover different dimensions by balancing SLIs across availability (does it work?), performance (is it fast?), and quality (is it correct?).

Consider edge cases because SLIs should capture edge case experiences - not just average latency but tail latency (p99), not just success rate but specific error types, not just uptime but partial degradation.

Measuring slis

Data sources for SLIs include server-side logs from request logs and application metrics, client-side instrumentation through real user monitoring, synthetic probes as automated checks from external points, and infrastructure metrics from load balancers, CDNs, and databases.

Aggregation must make raw measurements meaningful through time windows (per minute, hourly, daily), geographic regions, customer segments, and request types. Different aggregations reveal different patterns.

Sampling may be necessary for high-volume services, requiring consistent sampling methodology, statistical validity, and representative coverage.

Storage and retention needs sufficient retention for trend analysis, appropriate granularity, and accessible query interfaces.

Slis and slos

SLIs are what you measure; SLOs are what you target. If the SLI is "Request latency (p99)" and the SLO is "p99 latency < 200ms for 99.9% of requests," good SLOs set targets just below what you can consistently achieve, providing early warning before user experience degrades significantly.

Slis in practice

Dashboard design should show current values with context (good/warning/bad), trend over time, error budget consumption, and comparison to targets.

Alerting strategy should alert on SLI degradation through burn rate alerts when error budget depletes too quickly, threshold alerts for acute problems, and trend alerts for gradual degradation.

Incident response uses SLIs to guide response by determining which SLIs are affected, what's the user impact, and whether the fix restored SLIs to normal.

Post-mortems use SLI analysis to inform understanding of what actually happened from the user perspective, how quickly it was detected and resolved, and what would improve detection next time.

Sli pitfalls

Vanity metrics measure what's easy rather than what matters. Over-aggregation averages away problems that affect subsets of users. Internal focus measures server health rather than user experience. Complexity creates too many SLIs making focus impossible. Stale definitions don't evolve with the service.

Slis and product management

Product teams benefit from SLI understanding for prioritization (SLI data reveals where reliability work is needed), trade-offs (understanding costs of feature work versus reliability work), communication (discussing service quality with precision), and customer impact (connecting technical metrics to user experience).

Tools like Klero help connect SLI data to customer feedback. When customers report problems that SLIs don't capture, that's valuable signal that SLI definitions may need updating. When SLIs degrade without complaints, the metrics may be measuring the wrong things.

MODULES

INSIGHTS

What is sli (service level indicator)? complete guide & examples

Sli (service level indicator)

Why it matters

Sli characteristics

Common sli types

Choosing slis

Measuring slis

Slis and slos

Slis in practice

Sli pitfalls

Slis and product management

Start collecting feedback today

What is sli (service level indicator)? complete guide & examples

Sli (service level indicator)

Why it matters

Sli characteristics

Common sli types

Choosing slis

Measuring slis

Slis and slos

Slis in practice

Sli pitfalls

Slis and product management

Related terms

Start collecting feedback today