Chapter 8. Service-Level Agreements

Service-level agreements (SLAs) are all about expectations management. As discussed in Chapter 7, each service has different expectations around it. Many of these expectations are tied to the service tier of the service, but when we look deeper, the expectations are more specific than that.

SLAs as discussed in this book are not about legal or contractual agreements between a company and its customers; they’re agreements between teams and service owners. They provide a mechanism for communicating expectations between services.

SLA VERSUS SLO

In recent years, the term SLO, or service-level objective, has come into common usage. The distinction between SLA and SLO is that an SLA is used to describe a legal commitment to an external customer, while an SLO is used to describe the target for a service metric between teams. Using these definitions, agreements from one service to another such as those discussed in this chapter are more consistent with the term SLO. Technically, this is a valid distinction using this latest terminology.

However, I do not agree with this distinction. This is because, from my standpoint, this distinction waters down the importance of service-to-service commitments by using what seems like a less committed term (SLO). The term SLO appears to describe a weaker commitment than SLA describes. This is the heart of the problem. In my mind, the performance commitment that is made from one team for one service to another team with another service deserves the same level of importance as a customer’s legal commitment. As such, I use the term SLA for customer agreements and for internal service-to-service commitments.

For these reasons, in this book and especially in this chapter, you can safely assume that the terms SLO and SLA are mostly interchangeable.

In this chapter, we’re going to talk about SLAs and their use within the context of both external customers and internal customers. We will talk about SLAs as a method of gaining trust between service teams and how to use SLAs for interteam problem solving.

What Are SLAs?

SLAs are a commitment to provide a given level of reliability and performance. They are used to create a strong contractual relationship between service owners and consumers.

An overnight delivery service, for example, might have an SLA that states it will deliver a package before 9 a.m. the next morning. An airline might have an SLA expressing its ability to deliver checked baggage within 30 minutes after a flight arrives. A power company might have an SLA that states how fast it will fix power outages after a storm.

CUSTOMER EXPECTATIONS

Think back to the previous chapter and consider the online store application illustrated in Figure 7-4. Your customers expect the store to be operating when they want to use it—they expect it to be highly available. They also expect that the site will load fast so that they can use it without delay. Further, they expect the products they want to be available in your store. They expect you to have them in stock and available for shipment. Finally, they expect that when they place an order, the order will show up on their doorstep in a reasonable period of time.

Using “Customer Expectations” example, each of these expectations can be expressed as an SLA:

Availability

Customers expect the store to be operational when they need it. You can express this as a minimum percentage of time that your store is operational. An example availability SLA might be, “Our store will be available at least 99.4% of the time.”

Load time

Customers expect the web page to load fast—they want the website to appear responsive. There are many ways you can express this, but in the simplest way, it can be expressed as the maximum amount of time a page will take to load—for instance, “Pages will load within 4 seconds 99% of the time” (see “Top Percentile SLAs”).

Products

Customers expect the products they want to be available in your store. They also expect those products to be in stock and ready for shipment. You might express this as a percentage, such as “A minimum of 80% of the products in our catalog are in stock.”

Shipment

Customers expect the products they order to arrive quickly. You might express this as the time from order until the product is shipped, or as the amount of time until a product appears on the customer’s doorstep. As an example, “We ship all products within 24 hours.”

All of these are examples of SLAs. Although they are all quite different in nature and meaning, they all fundamentally have the same purpose. They express an expectation of your application by your customers.

You can measure the actual performance of each of these things as your application runs and interacts with customers. You might generate charts and graphs that show your measurements over time. But the SLA is the agreed limit at which your service can be considered performing as expected. The chart in Figure 8-1 shows your store’s performance on product in stock, which is a measure of the percentage of the products that are in stock at any given point in time.

Performance compared to SLA

Figure 8-1. Performance compared to SLA

You can see from the chart that your in-stock percentage varies over time. You can also see your SLA line, representing your expected performance of 80%.

Most of the time, your in-stock percentage is above the SLA (we say you are meeting your SLA). However, one time in late summer it dropped below your 80% SLA for a short period of time (we say you have failed your SLA).

EXTERNAL SLAS AND CUSTOMER COMMITMENTS

Sometimes a business has contractual agreements with customers that require it to meet established SLAs, perhaps with financial or other consequences for failing to meet them.

Amazon Web Services, for example, has SLAs with its customers and in some cases provides financial compensation if it fails to meet those SLAs.

For example, with Amazon EC2 instances, if AWS’s monthly uptime percentage falls below 99.95%, it gives a service credit of 10% to affected customers. If it falls below 99.0%, AWS gives a service credit of 30%.1

Having SLAs for monitoring the ability of your application to perform for your customers can be useful for your internal business uses (making sure you perform as expected for your customers). Or, as AWS does, SLAs may be used for making financial commitments to customers. In either case, the SLA and the way you measure performance against the SLA are identical.

External Versus Internal SLAs

The “Customer Expectations” and “External SLAs and Customer Commitments” examples demonstrate the use of external SLAs. These are SLAs we might specify and monitor describing how our application performs to our customers.

But SLAs can and should be used between individual services within your application. In this way, you can use them as mechanisms for communicating expectations and requirements between the owners and operators of individual services.

Why Are Internal SLAs Important?

Internal SLAs are critically important to the health and maintainability of complex multiservice systems.2 Why? Well, to put it simply, how can a service meet its commitments to its customers if the services it depends on are not meeting their commitments?3

How can you provide a 50 ms response to your customer when a service you require gives you a 90 ms response?

How can you provide 99% availability when a service you require provides only 90% availability?

SLAs as trust

SLAs are about building trust in a highly distributed and scalable way. When you trust a dependency can meet its expectations, you can set your own service’s expectations with confidence.

BUILDING TRUST

Consider the online store application illustrated in Figure 7-4. Imagine you and your team own the price and shipping cost calculator service. Your internal customers are the website frontend service and the checkout service. One of the primary operations they depend on you for is to look up the price of a product given the product number. Because these services use this to generate web pages for display to end customers, they need the price lookup to be fast. Your team makes an agreement to provide the price lookup uniformly within 20 ms of the request.

Now, for you to meet this commitment, you realize you need to have fast access to the catalog database service, which contains the data you need to calculate the price. However, given your 20 ms commitment, you are concerned that the catalog storage service might not be able to provide you the data you need fast enough. The catalog storage service is owned by another team. How can you be sure that team will be able to meet your performance requirements? You have two choices.

The first choice is to contact the owning team and look deeply into how its service works, looking for performance issues and problems, and then analyze the team to make sure you trust it will be able to perform as you need. This, of course, is highly intrusive, very expensive, and not practical for a large organization.

The second choice is to negotiate with the owning team and agree on a performance SLA for its service. Suppose that you work with the team, and it agrees to a 10 ms response. You know that if it can respond that fast, you can meet your own 20 ms guarantee to your customers.

As long as the other team can perform to its SLA, you can perform to your SLA.

You can monitor the team’s performance against its SLA over time to see how well it does. If the team consistently meets its SLA, you have trust in your dependency, and you can now focus your energies on your service and what you need to do to ensure that you can continue to meet your 20 ms guarantee to your customers.

SLAs for Problem Diagnosis

SLAs also provide a way of determining where problems exist in a complex system. If a service is experiencing problems, one of the first things to check is whether its dependencies are meeting their SLA expectations. If a dependent service is not meeting its expectations, this becomes a great spot to begin looking to diagnose the problem with your service.

FINDING A PROBLEM

Consider the online store application illustrated in Figure 7-4. Imagine that you and your team own the price and shipping cost calculator service, as described in “Building Trust”.

Now suppose that you receive a call in the middle of the night. Your service has become sluggish in generating price lookups, and it’s affecting your company’s customers. You check your performance compared to your 20 ms performance guarantee. You find that you are now taking, on average, 500 ms for each lookup. This has substantially slowed your company’s storefront, and your customers are dissatisfied.

But what caused the problem? Is there something wrong in your service? Or is it one of your dependencies that is having the problem?

It could be your service is having some problem—perhaps with its hardware, perhaps somewhere else. But before you spend a lot of time trying to figure out what is wrong with your service, you check the performance of your dependencies.

Knowing that your service depends on the catalog storage service and that you have a 10 ms SLA guarantee with the owning team, you check its performance against this SLA. You see that it, too, is having a performance problem. Rather than taking less than 10 ms per call, calls to the catalog storage service are taking over 400 ms. Obviously, that team is experiencing a performance problem. You check and find that its on-call team is already engaged and working on this problem.

Realizing this is likely the cause of your performance problem, you begin tracking the other team’s progress toward resolving its problem. This makes more sense than spending valuable time fruitlessly trying to figure out what’s wrong with your service.

By having well-defined SLAs with all your service dependencies, you can much more easily track when your service is having a problem or when a dependent service is having a problem.

Performance Measurements for SLAs

There are many measures of performance that services can use, and the specific measures used can and should vary based on the service consumer’s and owner’s needs and requirements. Here are some example types of performance measures:

Call latency

This is a measure of how long a service call takes to process a request and return a response. Typically measured in milliseconds or microseconds, it is important for the consumer of a service to know how long it takes for a request to be processed, because that time will be part of the total time the consumer takes to process its request. This is the type of SLA used in the previous section’s sidebars.

Traffic volume

This is a measure of how many requests a service can handle over a period of time. Typically measured in requests/second, a service owner must know how much traffic to expect from a consumer in order to meet its expectations.

Uptime

This is a measure of how much time a service is expected to be up, healthy, and free of major problems. Typically calculated as a percentage, it is a measure of how available the service has been over a specified period of time (typically a day, month, or year).

Error rates

This is a measure of how many failures a service generates over a period of time. Typically measured as a percentage, it is normally the number of failed requests divided by the total number of requests processed over a given time period.

Limit SLAs

A limit SLA typically specifies an operational limit that is expected to be met. If actual performance is better than this limit, we have met our SLA. If actual performance is worse than this limit, we have failed our SLA. The limit itself is the value of the SLA.

For example, “call rate must be <1,000 reqs/sec” specifies a limit SLA on the expected traffic volume of a service. If expected traffic volume is less than the specified limit, then the service has met its SLA.

As another example, “service will be operational for at least 99.5% of the time” specifies an availability requirement of a service. If the availability of the service is greater than the specified percentage, then the service has met its SLA.

You can apply a limit SLA to most types of performance measures.

Top Percentile SLAs

Limit SLAs are great when you can measure a value and have a guarantee that the value stays better than that limit at all times. These types of SLAs are great for expressing availability, uptime, and error rates.

Another type of SLA measurement is a top percentile SLA. You use it to measure performance of an event when the actual performance of that event typically varies considerably.

Top percentile SLAs are great for measuring metrics such as call latency. The amount of time a request to a service takes to generate a response can vary wildly, and most of the time we don’t care whether every request can be handled in less than a specific period of time as long as most requests are handled in less than a specific period of time.

A top percentile SLA is expressed as a percentage of the total data points that are above/below a specific value. The SLA is usually written like this:

TP is less than

Here’s an example:

TP90 is less than 20 msec

This can be read as “90% of all requests will take less than 20 ms.”

Often, we will calculate the performance of an event, such as the call latency to a service, and express it as an actual top percentile for the service.

As an example, suppose that we have a service that responds to service calls. Over a period of time, we have observed the latencies for these service calls shown in Figure 8-2.

We can chart these values, as shown in Figure 8-3.

Table of service call latencies

Figure 8-2. Table of service call latencies

Chart of service call latencies

Figure 8-3. Chart of service call latencies

Using this data, we can calculate several top latency values for this service:

TP90

This is the value that 90% of the latency values are below. In this example, 90% of the data is 18 data points. Removing the top 2 data points (45 ms and 32 ms) will leave us with 18 data points, the highest value of which is 30 ms. So we can say our TP90 is 30 ms.

TP80

This is the value that 80% of the latency values are below. In this example, that means removing the top four data points: 45, 32, 30, and 28 ms). Among the remaining 16 data points, the highest one is 22 ms. So we can say our TP80 is 22 ms.

Continuing on, here are several TP values representing that data:

TP95 = 32 msec

TP90 = 30 msec

TP80 = 22 msec

TP50 = 14 msec

There are some other occasionally useful values to use:

TPmax = 45 msec (maximum value)

TPmin = 4 msec (minimum value)

TPavg = 18 msec (average value)

The top percentiles can of course change over time. After you have it calculated, you can use a limit SLA to define expectations. For instance, in this example, your service might have the following SLA:

TP90 < 35 msec

If it did, the service would have met its SLA. However, if it had committed to the following SLA:

TP80 < 20 msec

the service would not be meeting its SLA (the current TP80 is 22 ms). So the service would have failed its SLA.

SLA Conditionals

SLAs sometimes are expressed in a way that makes them conditional on another metric. For example, a service might be able to guarantee a specific latency, but only if the call volume stays within a reasonable amount. So an SLA may be expressed as follows:

Call Latency TP90 < 25 msec when Traffic Volume < 250k req/sec

Here, in order to meet our SLA, the TP90 for call latency must be less than 25 ms when traffic volume is below 250k req/sec. If the traffic volume is above that rate, then call latency can be any value.

How Many and Which Internal SLAs?

As you build your service, a question you might ask is, how many internal SLAs should I define for my service?

First, keep the number as low as possible. Understanding the meaning of SLAs and their effect becomes very complicated as the number of SLAs increases.

Ensure that you have covered all critical areas within your service. You should have appropriate SLAs for all major pieces of functionality and especially for the areas that are critical to your business.

You should negotiate your SLAs with the consumers of your services, as an SLA that does not meet a consumer’s needs is an irrelevant SLA. However, as much as possible, use the same SLA for all consumers. Your service should have, as much as possible, a single set of SLAs that should meet the needs of all your consumers. Having a set of SLAs created per-consumer adds significantly to your complexity and doesn’t provide any real benefit.

You should only specify SLAs that you can actually monitor and alert on. There is no value in specifying an SLA if you cannot validate whether you are hitting it. Additionally, you care if your service violates the SLA, because this should be a leading indicator of a problem, so make sure you receive an alert when an SLA is being violated.

You might want to monitor and alert on values over and above those that you report as internal SLAs. This data can be useful in finding and managing problems in your service without actually being a committed value to your consumers.

You should build a dashboard that contains all of your SLAs and monitors so that you can see at a glance if you are experiencing any problems, and you should make this dashboard available to all your dependencies so that they can see how well your service is performing.

Additionally, ensure that you have access to the dashboards for all of your dependent services so you can monitor whether they are having problems, which might or might not be affecting your service.

Why Internal SLAs Are Important

Monitoring and using SLAs can quickly become overwhelming, and you can easily become caught up in the minute details of SLA monitoring.

Perfect, all-inclusive SLA monitoring is not our goal. Having a number you can use to compare is the goal. Any number is better than no number. The purpose of internal SLAs is not to add up numbers but to provide guidance for you and your dependencies, and to help set expectations between teams appropriately.

Internal SLAs are a critical component in your ability to scale your application size so that more development teams can be utilized in developing and managing your applications. This improves complexity scaling and overall application availability.

SLAs can and should become part of the language you use when talking to other teams.

1 You can find more details on how AWS calculates this SLA and the credit at https://aws.amazon.com/ec2/sla.

2 Or SLOs. This is where the modern distinction between SLAs and SLOs described earlier in this chapter may apply.

3 See Chapter 6 for more information on team-level ownership of services.

If you find an error or have any questions, please email us at admin@erenow.net. Thank you!