Architecting for Scale is about modernization. It’s about building and updating your critical applications to meet the needs of your increasingly demanding digital customers. It’s about high availability, it’s about architecting your applications using modern development and operations techniques, it’s about organizing your development teams to help your applications—and your business—succeed, it’s about scaling to your biggest days, it’s about utilizing the resources available to you in the cloud to meet these challenges.
The process of architecting for scale is so much more than handling a large volume of traffic.
Who Should Read This Book
This book is intended for architects, managers, and directors who build and operate large-scale applications and systems, whether in an engineering or an operations organization. If you manage software developers, system reliability engineers, or operation teams, or you run an organization that contains large-scale applications and systems, the suggestions and guidance provided in this book will help you make your applications run smoother and more reliably.
If your application started small and has seen incredible growth (and is now experiencing some of the growing pains associated with that growth), you might be suffering from reduced reliability and reduced availability. If you struggle with managing technical debt and associated application failures, this book will provide guidance in reducing that technical debt to make your application able to handle larger scale more easily.
Why I Wrote This Book
After spending seven years working at Amazon building highly scaled applications in both the retail and the Amazon Web Services (AWS) worlds, I moved to New Relic, which was in the midst of hypergrowth. The company felt the pain of needing the systems and processes required to manage highly scaled applications, but it hadn’t yet fully developed the processes and disciplines to scale its application.
At New Relic, I saw firsthand the struggles of a company going through the process of trying to scale, and I realized that there were many other companies experiencing the same struggles every day.
Now I travel all over the world, talking to customers and other people just like you about the cloud, about scaling, about availability, and about the critical process of building modern applications. I give presentations, panel discussions, classes, seminars. I talk one on one with engineering leaders and executives to both help them achieve their goals and learn from them what works and what doesn’t work. I write articles. I give interviews. I participate in podcasts.
My intent with this book is to help others working with high-growth applications to learn processes and best practices that can assist them in avoiding the pitfalls awaiting them as they scale.
Whether your application is growing tenfold or just 10% each year, and whether the growth is in number of users, number of transactions, amount of data stored, or code complexity, this book can help you build and maintain your application to handle that growth, while maintaining a high level of availability.
A Word on Scale Today
As applications grow, two things begin to happen: they become significantly more complicated, and they handle significantly larger traffic volumes.
Increased complexity means increased brittleness. More traffic means more novel and complex mechanisms to manage the traffic.
Application developers seldom build scalability into their applications from the beginning. We often think we have built in scalability, and we believe we’ve done what was necessary to let our application scale to the highest levels we can imagine. But more often than not, we find faults in our logic and in our applications. These faults appear only after we begin to see scaling problems, and that makes scaling to larger traffic volumes and larger datasets more difficult.
This leads to even greater complexity and even more brittleness.
Ultimately, this scale/brittleness/scale/complexity cycle turns into a death spiral for an application, as it experiences brownouts, blackouts, and other quality-of-service and availability problems.
But these are your problems. Your customers don’t care about these issues. They just want to use your application to do the job they expect it to do. If your application is down, slow, or inconsistent, customers will simply abandon it and seek out competitors that can handle their business.
How can we improve the scalability of our applications, even when we begin to reach these limits? Obviously, the sooner we consider scalability in the lifecycle of an application, the easier it will be to scale. Yet we don’t want to overarchitect our applications for scalability beyond what is required. At any point during the lifecycle, there are many techniques you can use to improve the scalability of your application.
But before you can consider techniques for scaling your application, you must get your application availability in shape. Nothing else matters until you make this leap and make these improvements. If you do not implement these changes now, up front, you will find that as your application scales, you will begin to lose sight of how it’s working, and random, unexpected problems will begin occurring. These problems will create outages and data loss and will significantly affect your ability to build and improve your application. Furthermore, as traffic and data increases, these problems simply become worse. Before doing anything else, get your availability and risk management in order.
What’s New in the Second Edition
While many of the concepts discussed in this book are mostly timeless, many (such as serverless computing) have had to be updated to reflect industry changes over the last four years.
Additionally, I’ve spent the last several years traveling around the world talking and speaking about these topics. I’ve learned a lot from various interactions with customers and other experts, and I’ve incorporated much of what I’ve learned into this edition.
An extensive update on cloud utilization has also been added to this book.
Finally, the content has been significantly restructured and reorganized from the first edition to make the information more accessible and relevant.
Using the Cloud
Cloud-based services are growing and expanding at extremely high speeds. Software as a Service (SaaS) is becoming the norm for application development, primarily because of the need for providing these cloud-based services. SaaS applications are particularly sensitive to scaling issues due to their multi-tenant nature.
As our world changes and we focus more and more on SaaS services, cloud-based services, and high-volume applications, scaling becomes increasingly important. There does not seem to be an end in sight to the size and complexity to which our cloud applications can grow.
The very mechanisms that are state of the art today for managing scale will be nothing more than basic tenants tomorrow, and the solutions to tomorrow’s scaling issues will make today’s solutions look simplistic and minimalistic. Our industry will demand more and more complex systems and architectures to handle the scale of tomorrow.
Naturally, as time goes on, some material in this book will become dated. My intent is to provide as much content as possible that stands the test of time.
Services Versus Microservices
There is much controversy in the industry about use of the terms service and microservice. I personally do not like the term microservice because it implies a specific sizing of a service that is not necessarily a healthy assumption. Many services are small, and some are truly “micro,” but many are much larger too. The appropriate size determination is based on context and is subject to many concerns and criteria,1 and in my mind the use of the term microservices biases this discussion. However, I recognize that the term microservice has gained strong popularity in the industry.
There are also people that pigeonhole use of the term service as part of the term SOA and further pigeonhole these terms to refer to a particular type of architecture offering that was popular a decade or more ago. I find these comparisons inaccurate and confusing.
My personal preference is to use the term service, but I recognize many people use the term microservice. So I tend to use both terms in my discussions with other companies, depending on context. In my mind, both terms mean the exact same thing.
There is another use of the word service, though, that is worth discussing. This is when you refer to an external service, such as when you say “Amazon offers the Amazon S3 service.” This use of the word service is seemingly distinct, and seems like a different use of the word service, but in reality it is the same thing. A “service” is a software module that provides a very specific piece of functionality and the data that supports that functionality. Whether the service is written by your developers or by engineers over at Amazon is irrelevant. I do recognize that sometimes it is important to distinguish between these two types of services, however.
So this is how I will use these terms in this book. I’ll use both terms interchangeably, depending on context. You will definitely see my bias toward the word service in this book. You should assume both terms mean the exact same thing. When I am referring to a specific type of service provided by another company, such as a cloud service, I will so indicate. In these cases, you will see the use of terms such as “AWS service” or “cloud service” or “SaaS service.”
Modern Digital Customer Experiences
In our modern digital world, software applications become the face of our brand and our company. The way our customers interact with us is through our software. Our applications aren’t just part of the customer experience. In many cases, they are the entire customer experience. Software is critical to our success, and modern customers expect our applications to also be modern. How our customers perceive our brands and our company depends greatly on how they perceive our software.
A NON-MODERN APPLICATION
Consider this example: my son has an application on his smartphone that he has to use to get some of his medical benefits. It is a government application, built and run by the US government.
This application doesn’t work all the time. When you launch the application at an odd time of day, you get an error message. The error message says, “This application is only available to use between the hours of 9–5, Monday–Friday, Eastern Time.”
Yep, that’s right. This is a mobile software application on his smartphone, and the software is disabled except during East Coast business hours.
Can your business operate with an application such as this? Can it operate with this type of restriction on its use? Can any commercial business put limits like this on its customers and stay in business?
No, I bet there isn’t a single commercial enterprise out there that can survive and treat its customers this way. Instead, we have to provide our customers with memorable customer experiences. Our applications must work whenever our customers want to use them. Everything needs to work 100% of the time, 24 hours a day, 7 days a week. If not, we disappoint our customers, and disappointed customers go away.
Navigating This Book
Managing scale is not only about managing traffic volume—it also involves managing risk and availability. Often, all these things are different ways of describing the same problem, and they all go hand in hand. Thus, to properly discuss scale, we must also consider availability, risk management, team/organization processes, and modern architecture paradigms such as microservices and cloud computing.
As such, this book is organized into five major parts, each representing a major tenet of architecting for scale. Let’s take a look at each of these.
Tenet 1. Availability: Maintaining Availability in Modern Applications
Modern software must maintain a high level of availability. Customers will not tolerate outages. If your application does not function when your customer needs it, they will not remain a customer for long.
Part I discusses the importance of application availability to our customers, and how it is impacted by application scaling. Understanding, measuring, and improving availability are the focus of these chapters.
Chapters in this part include:
· Chapter 1, Understanding, Measuring, and Improving Your Availability
· Chapter 2, Two Mistakes High—Having Room to Recover from Mistakes
Tenet 2. Modern Application Architecture: Using Services
Modern software requires the use of modern application architectures. Modern application architectures require moving away from monolithic applications and embracing service-based architectures.
Monolith applications are extremely hard to scale, both from a traffic scaling standpoint and from the standpoint of your ability to scale the size of your organization to work on the application. The larger the monolith, the slower it is to make changes to the application, the fewer the people who can work on it and manage it effectively, and the greater the likelihood that traffic variations and growth will negatively impact availability.
Service-oriented architectures solve these problems by providing greater flexibility in scaling based on traffic needs. In addition, they provide a scalable framework to allow larger development organizations to work on the application, allowing the applications themselves to get larger and more complex.
Chapters in Part II include:
· Chapter 3, Using Services
· Chapter 4, Services and Data
· Chapter 5, Dealing with Service Failures
Tenet 3. Organization: Scaling Your Organization for Modern Applications
You cannot build modern software unless your development organization makes use of modern processes and procedures. This includes service ownership responsibilities and development processes.
It doesn’t matter how scalable your application is; you cannot scale your application if your development organization isn’t structured to support it, or if your organization does not have the right culture to drive higher availability and greater scalability.
Organizing your teams to better support your scalability needs will create a culture that supports your application’s scaling needs.
Chapters in Part III include:
· Chapter 6, Service Ownership—STOSA
· Chapter 7, Service Tiers
· Chapter 8, Service-Level Agreements
Tenet 4. Risk: Risk Management for Modern Applications
You cannot remove all risk from a system. It just isn’t possible. All complex systems have inherent risk. Instead, we must learn to manage the risk and use risk as a tool for evaluating technical debt and making decisions on application improvements.
Understanding risk, measuring risk, and prioritizing activities based on measured risk are important tools for building highly scaled, high-availability applications.
Chapters in Part IV include:
· Chapter 9, Using Risk Management When Architecting for Scale
· Chapter 10, Game Days
· Chapter 11, Building Systems with Reduced Risk
Tenet 5. Cloud: Utilizing the Cloud
High availability in a modern application requires nimble scaling. We can no longer afford to have excess infrastructure capacity lying around to meet the peak needs of our application. Instead, we must dynamically allocate and consume infrastructure resources, on demand, based on our current needs.
Dynamic infrastructures, and applications that can support and optimize dynamic infrastructures, are a critical architectural component to building highly scaled, highly available applications.
Dynamic infrastructures are the cornerstone benefit of the public cloud. Utilizing the public cloud is essential to keeping your application highly available at scale.
Chapters in Part V include:
· Chapter 12, Getting Started Architecting for Scale with the Cloud
· Chapter 13, Five Industry Trends Changed by the Cloud
· Chapter 14, Types of SaaS and Tenancy
· Chapter 15, Distributing Your Application in the AWS Cloud
· Chapter 16, Managed Infrastructure
· Chapter 17, Cloud Resource Allocation
· Chapter 18, Serverless and Functions as a Service
· Chapter 19, Edge Computing
· Chapter 20, Geographic Impact on Using the Cloud
These are the five critical tenets to building applications that meet the modern needs of our customers. These tenets form the basis of Architecting for Scale.
The Architecting for Scale website (www.architectingforscale.com) offers additional information about this book, including links to supplementary material. You can find more information about me on my website at www.leeatchison.com, and you can also follow my blog at www.leeatscale.com.
Conventions Used in This Book
The following typographical conventions are used in this book:
This element signifies a tip or suggestion.
This element signifies a general note.
O’Reilly Online Learning
For more than 40 years, O’Reilly Media has provided technology and business training, knowledge, and insight to help companies succeed.
Our unique network of experts and innovators share their knowledge and expertise through books, articles, conferences, and our online learning platform. O’Reilly’s online learning platform gives you on-demand access to live training courses, in-depth learning paths, interactive coding environments, and a vast collection of text and video from O’Reilly and 200+ other publishers. For more information, please visit http://oreilly.com.
How to Contact Us
Please address comments and questions concerning this book to the publisher:
· O’Reilly Media, Inc.
· 1005 Gravenstein Highway North
· Sebastopol, CA 95472
· 800-998-9938 (in the United States or Canada)
· 707-829-0515 (international or local)
· 707-829-0104 (fax)
We have a web page for this book, where we list errata, examples, and any additional information. You can access this page at https://oreil.ly/architecting-for-scale-2e.
Email firstname.lastname@example.org to comment or ask technical questions about this book.
For more information about our books, courses, conferences, and news, see our website at http://www.oreilly.com.
Find us on Facebook: http://facebook.com/oreilly
Follow us on Twitter: http://twitter.com/oreillymedia
Watch us on YouTube: http://www.youtube.com/oreillymedia
While there are more people who helped make this book possible than I could possibly ever list here, I do want to mention several people who were particularly helpful to me:
· Ken Gavranovic, the word friend is not sufficient to describe you. Always trust the power of monkeys.
· Bjorn Freeman-Benson, who supported me significantly in the early stages of developing the first edition of this book, and who gave me opportunities at New Relic that helped provide me the insights I needed for this book. I am so glad our friendship has continued past those days we directly worked together.
· Kevin McGuire, who has been a friend and confidant. We started at New Relic together, and it was your foresight and imagination that has helped give my career the needed focus and direction that guides me today.
· Abner Germanow, Darren Cunningham, Jay Fry, Bharath Gowda, and Robson Grieve, who took a chance on me and fought to get me my thought leadership role at New Relic. The days I worked with you all were by far the most fun, rewarding, and personally fulfilling I have ever had. I miss those times greatly. Abner in particular, without you I would not have the career I have today. You guided me into this new role and helped me grow from an engineer and architect into a strategist, pundit, and thought leader. Thank you for believing in me and mentoring me along that path.
· Jim Gochee, who introduced me to the magic that was New Relic, both as a product and eventually as a career.
· Lew Cirne, whose vision has given us New Relic, and me a career and a home. The joy and driven enthusiasm you get after meeting with Lew one on one is highly infectious and hugely empowering. No wonder New Relic is so successful.
· Kevin Downs, my friend and cloud buddy. Say hi to the mouse for me. Oh, and by the way, containers rule.
· Brandon SanGiovanni, my friend. From MLB to Marvel and Mickey, you’ve dealt firsthand with many of the challenges I discuss in this book, and you are still alive and smiling! Thank you for your support, your knowledge, and, most importantly, your friendship.
· Abbas Haider Ali, who is someone I greatly respect. We both have roles as industry thought leaders, and it’s great to have someone to bounce ideas off of and get suggestions from. Your input in early drafts of this book has made it substantially better. Thank you!
· Kurt Kufeld, who mentored me and helped me fit into the weird, chaotic, challenging, draining, and ultimately hugely rewarding world called Amazon.
· Greg Hart, Scott Green, Patrick Franklin, Suresh Kumar, Colin Bodell, and Andy Jassy, who gave me opportunities at Amazon and AWS I could not have ever imagined.
· Brian Anderson, my original editor, and Kathleen Carr, my current editor at O’Reilly. Together, they are responsible for making this book and many other projects at O’Reilly Media happen. Brian made the first edition of this book possible. Kathleen encouraged and enabled me to build the much expanded second edition, along with several courses, trainings, and knowledge sessions.
· Amelia Blevins from O’Reilly, who made substantial editorial suggestions to the format, layout, and content of this expanded second edition of the book. These suggestions made a huge difference in the quality and readability of the book. If you like the new structure of this second edition, you have Amelia to thank for it.
To all of those people who reached out to me after reading the first edition of this book, giving me your praise, encouragement, and suggestions, I thank you for helping to keep me motivated and to give me ideas for the improvements that went into this second edition.
To my family, and especially my wife, Beth, who is my constant light and guide through this life we have together. My days are brighter, and my path is clearer, because she is with me.
To all these people, and all the people I did not mention, my heartfelt thank you.
I can’t end without also mentioning the furry ones: Issie, the snoring spaniel, and Abbey, the joyful corgi. And finally, Budha, the krazy kitty, who contributed more than his share of typos to this book.
1 I talk about service sizing in greater detail in “Dividing into Services”.