Chapter 11. Building Systems with Reduced Risk

In Chapter 9, we learned how to mitigate risks that exist within your system and applications. However, there are things you can do to proactively build your applications with a reduced risk profile. This chapter discusses the following techniques:

Technique #1: Introduce Redundancy

Building in redundancy allows you to survive issues that would otherwise cause outages but potentially at the cost of system complexity.

Technique #2: Understand Independence

It’s important and useful to know what it means for components to be independent and to understand the (sometimes hidden) dependencies among services, resources, and system components.

Technique #3: Manage Security

Bad actors are an increasingly common cause of availability issues and introduce significant risk to modern applications.1

Technique #4: Encourage Simplicity

Complexity is the enemy of stability. The more complex your application, the easier it is for a problem to occur.

Technique #5: Build in Self-Repair

Even when problems do occur, the more automated your repair processes, the less impact a given problem will have on your customers.

Technique #6: Standardize on Operational Processes

Variation in the way you do business can introduce risk and ultimately can cause availability issues. Standardized, documented, and repeatable processes decrease the likelihood of manual mistakes causing outages.

This is far from an exhaustive list, but it should at least get you thinking about risk reduction as you build and grow your applications.

Technique #1: Introduce Redundancy

Building in redundancy is an obvious step toward improving the availability and reliability of your application. This inherently reduces your risk profile as well. However, redundancy can add complexity to an application, which can increase the risk to your application. So it is important to control the complexity of the additional redundancy to actually have a measurable improvement to your risk profile.

Here are some examples of “safe” redundancy improvements:

· Design your application so that it can safely run on multiple independent hardware components simultaneously (such as parallel servers or redundant data centers).

· Design your application so that you can run tasks independently. This can help recovery from failed resources without necessarily adding significantly to the complexity of the application.

· Design your application so that you can run tasks asynchronously. This makes it possible for tasks to be queued and executed later without impacting the main application processing.

· Localize state into specific areas. This can reduce the need for state management in other parts of your application. This reduction in the need for state management improves your ability to utilize redundant components.

· Utilize idempotent interfaces wherever possible. Idempotent interfaces are interfaces that can be called repeatedly in order to assure an action has taken place, without the need to worry about the implications of the action being executed more than once.

Idempotent interfaces facilitate error recovery by using simple retry mechanisms.

Idempotent Interfaces

An idempotent interface is an interface that can be called multiple times, and only the first call has any effect. Successive or duplicate calls have no effect. Meanwhile, non-idempotent interfaces have an impact each and every time they are called.

The best way to understand this is by example.

The following sidebar describes an idempotent interface. You can call the command “Set the current speed of the car to 35 mph” any number of times. Each time you call it, the car speed is set to 35 mph. No matter how many times you call the interface, the car remains running at 35 mph.


Let’s assume you have a smart car. The car supports an API that allows you to change the speed of the car. The API provides an interface that allows you to issue the following command:

> Set the current speed of the car to 35 mph

Issuing this command causes the car to set its speed to 35 mph.

This next sidebar describes a non-idempotent interface. Every time you call the interface, you change the speed of the car by the specified amount. If you call the interface the correct number of times with the correct values, you can set your car to travel at 35 mph.


Let’s assume you have another smart car. This car also supports an API that allows you to change the speed of the car. This car, however, has a different interface for the API. This API’s interface allows you to issue the following command:

> Increase the speed of the car by 5 mph

By calling this API seven times, for example, you can change your speed from zero to 35 mph.

However, every time you call the interface, the car changes speed by the specified amount. If you keep calling the car with the command “increase the speed of the car by 5 mph,” the car will keep going faster and faster with each call. In this case, it matters how many times you call the interface, so this is a non-idempotent interface.

With an idempotent interface, a “driver” of this smart car only has to tell the car how fast it should be going. If, for some reason, it believes the request to go 35 mph did not make it to the car, it can simply (and safely) resend the request until it is sure the car received it. The driver can then be assured that the car is, in fact, going 35 mph.

With a non-idempotent interface, if a “driver” of the car wants the car to go 35 mph, it sends a series of commands instructing the car to accelerate until it’s going 35 mph. If one or more of those commands fails to make it to the car, the driver needs some other mechanism to determine the current speed of the car and decide whether to reissue an “increase speed” command or not. It cannot simply retry an increase speed command—it must figure out whether it needs to send the command or not. This is a substantially more complicated—and error-prone—procedure.

Using idempotent interfaces lets the driver perform simpler operations that are less error prone than using a non-idempotent interface.

Redundancy Improvements That Increase Complexity

What are some examples of redundancy improvements that increase complexity? In fact, there are many that might seem useful, but their added complexity can cause more harm than good, at least for most applications.

Consider the example of building a parallel implementation of a system so that if one fails, the other one can be used to implement the necessary features. Although this might be necessary for some applications for which extremely high availability is important (such as the Space Shuttle example in Chapter 2), it often is overkill and results in increased complexity as well. Increased complexity means increased risk.

Another example is overtly separated activities. Using a microservice architecture is a great model to improve the quality of your application and hence reduce risk. Chapter 3 contains more information on using services and microservices. However, if taken to an extreme, building your systems into too finely decomposed microservices can result in an overall increase in application complexity, which increases risk.

Technique #2: Understand Independence

Multiple components utilizing shared capabilities or components may present themselves as independent components, but in fact they are all dependent on a common component, as shown in Figure 11-1.

Dependency on shared components reduces independence

Figure 11-1. Dependency on shared components reduces independence

If these shared components are small or unknown, they can inject single point failures into your system.

Consider an application that is running on five independent servers.

You are using five servers to increase availability and reduce the risk of a single server failure causing your application to become unavailable. Figure 11-2 shows this application.

Independent servers…

Figure 11-2. Independent servers…

But what happens if those five servers are actually five virtual servers all running on the same hardware server? Or if those servers are running in a single rack? What happens if the power supply to the rack fails? What happens if the shared hardware server fails?

As illustrated in Figure 11-3, your “independent servers” might not be as independent as you think.

…aren’t as independent as you think

Figure 11-3. …aren’t as independent as you think

Technique #3: Manage Security

Bad actors have always been a problem in software systems. Security and security monitoring have always been a part of building systems, even before large-scale web applications came about.

However, web applications have become larger and more complicated, storing larger quantities of data and handling larger quantities of traffic. Combined with a higher usefulness to the data available within these applications, this has led to a huge increase in the number of bad actors attempting to compromise our applications. Compromises by bad actors can be directed at acquiring highly sensitive private data, or they can be directed at bringing down large applications and making them unavailable. Some bad actors do this for monetary gain, while others are simply in it for the thrill. Whatever the motivation, whatever the result, bad actors are becoming a bigger problem.

Web application security is well beyond the purview of this book. However, implementing high-quality security is imperative to both ensuring high availability and mitigating risk for highly scaled applications. The point here is that you should include security aspects of your application in your risk analysis and mitigation, as well as in your application development process. However, the specifics of what that includes are beyond the scope of this book.

Technique #4: Encourage Simplicity

Complexity is the enemy of stability. The more complex a system becomes, the less stable it is. The less stable a system is, the riskier it becomes, and the lower the availability it is likely to have.

Although our applications are becoming larger and significantly more complicated, keeping simplicity in the forefront as you architect and build your application is critical to keeping the application maintainable, secure, and low risk.

One common place where modern software construction principles tend to increase complexity more than perhaps is necessary is in microservice-based architectures. Microservice-based architectures reduce the complexity of individual components substantially, making it possible for the individual services to be easily understood and built using simpler techniques and designs. However, although they reduce the complexity of the individual microservice, they increase the number of independent modules (microservices) necessary to build a large-scale application. By having a larger number of independent modules working together, you increase the interdependence on the modules and increase the overall complexity of the application.

It is important as you build your microservice-based application that you manage the trade-off between simpler individual services and more complex overall system design.

Technique #5: Build in Self-Repair

Building self-righting and self-repairing processes into our applications can reduce the risk of availability outages.

As discussed in Chapter 1, if you strive for 5 nines of availability, you can afford no more than 26 seconds of downtime every month. Even if you strive for only 3 nines of availability, you can afford only 43 minutes of downtime every month. If a failure of a service requires someone to be paged in the middle of the night to find, diagnose, and fix the problem, those 43 minutes are eaten up very quickly. A single outage can result in missing your monthly 3 nines goal. And to maintain 4 nines or 5 nines, you have to be able to fix problems without any human intervention at all.

This is where self-repairing systems come into play. Self-repairing systems sound like high-end, complex systems, but they don’t have to be. A self-repairing system can be nothing more than including a load balancer in front of several servers that reroutes a request quickly to a new server if the original server handling a request fails. This is a self-repairing system.

There are many levels of self-repairing systems, ranging from simple to complex. Here are a few examples:

· A “hot standby” database that is kept up to date with the main production database. If the main production database fails or goes offline for any reason, the hot standby automatically picks up the “master” role and begins processing requests.

· A service that retries a request if it gets an error, anticipating that perhaps the original request suffered a transient problem and that the new request will succeed.

· A queuing system that keeps track of pending work so that if a request fails, it can be rescheduled to a new worker later, increasing the likelihood of its completion and avoiding the likelihood of losing track of the work.

· A background process (for example, something like Netflix’s Chaos Monkey) that goes around and introduces faults into the system, and then the system is checked to make sure it recovers correctly on its own.

· A service that requests multiple, independently developed and managed services to perform the same calculation. If all services return the same result, the result is used. If one or more independent services return a different result than the majority, that result is thrown away and the faulty service(s) is shut down for repairs.

These are just some examples. Note that the more involved systems at the end of the list also add much more complexity to the system. Be careful of this. Use self-repairing systems where you can to provide significant improvement in risk reduction for a minimal cost in complexity. But avoid complicated systems and architectures designed for self-repair that provide a level of reliability higher than you really require, at the cost of increasing the risk and failures that the self-repair system itself can introduce.

Technique #6: Standardize on Operational Processes

Humans are involved in our software systems, and humans make mistakes. By using solid operational processes, you can minimize the impact of humans in your system, and reducing access by humans to areas where their interaction is not required will reduce the likelihood of mistakes happening.

Use documented, repeatable processes to reduce one significant aspect of the human involvement problem—human forgetfulness: forgetting steps, executing steps out of order, or making a mistake in the execution of a step.

But documented, repeatable processes reduce only that one significant aspect of the human involvement problem. Humans can introduce other problems. Humans make mistakes, they fat finger the keyboard, they think they know what they are doing when they really don’t. They perform unrepeatable actions. They perform unauditable actions. They can perform bad actions in emotional states.

The more you can automate the processes that humans normally perform in your production systems, the fewer mistakes that can be introduced, and the higher the likelihood that the tasks will work.


Suppose that you regularly reboot a server (or series of servers) for a specific purpose (we won’t provide commentary on whether this is a good idea operationally).

You could simply have the user log in to the server, become a superuser, and execute the “reboot” command. However, this introduces several problems:

· You now have to give the ability to log in to your production servers to anyone who might need to perform that command. Further, they must have superuser permission to execute the reboot command.

· While someone is logged in as a superuser to the server, they could accidentally execute another command, one that causes the server to fail.

· While someone is logged in as a superuser to the server, they could act as a bad actor and execute something that would intentionally bring harm to the server, such as running rm -rf / on Linux.

· You will likely have no record that the action occurred, and no record of who did the reboot and why.

Instead of using the manual process to reboot the server, you could implement an automated process that performs the reboot. In addition to doing the reboot, it could provide the following benefits:

· It would reduce the need to give login credentials to your production servers, eliminating both the likelihood of mistakes as well as the likelihood of bad actors doing bad things.

· It could log all actions taken to perform the reboot.

· It could log who requested the reboot.

· It could validate that the person who requested the reboot has permissions to do the reboot (fine-grained permissions—you could grant access to reboot the server to a group of people without giving them any additional access rights).

· It could make sure that any other necessary actions occur before the server is rebooted—for instance, temporarily removing the server from the load balancer, shutting down the running applications gracefully, and so on.

You can see that by automating this process, you avoid mistakes and provide the ability to have more control over who and how the operation is performed.


Automated processes are repeatable processes. Repeatable processes are tested processes. Tested processes tend to have fewer errors than ad hoc processes. It truly is that simple.

Reducing risk in systems you are building involves implementing standard techniques that are designed to reduce risk. These techniques are simple but effective ways of reducing risk and hence increasing the availability of your application.

1 Bad actors are individuals who attempt to harm or compromise a system for illicit purposes.

If you find an error or have any questions, please email us at Thank you!