Google has made its book Site Reliability Engineering available for free online, and it’s a terrific resource. Reading it was a nice supplement to some of the lessons I learned from The DevOps Handbook and The Phoenix Project.
The most powerful ideas in the book that I would like to apply in my own work are the concepts of error budgets and Service Level Objectives (SLOs). (What most people refer to as an SLAs—Service Level Agreements—are better defined as SLOs since agreements imply some actual contract with penalties for poor performance.) Error budgets are the rate at which SLOs can be missed. 100% availability is impossible, and increasing availability often comes at the price of slower feature development. The role of an SRE is often to manage this trade-off between reliability and new feature development. An error budget specifies what a tolerable level of errors, latency, etc. is acceptable. Going over the error budget means a team needs to slow down and make their systems more reliable. Not using enough of the error budget may mean a team is too prudent and should be more aggressive with releasing new features.
Another portion of the book I particularly enjoyed was the chapter on eliminating toil. Some of the most common attributes of toil is that it is manual, repetitive, automatable, tactical (i.e. reactive and interrupt-driven instead of proactive and strategic), lacking in enduring value, and scales
O(n) as the service grows. I can think of a few of these tasks in my own work and already have a couple of ideas for how to automate this toil away.
This is a handy book for anyone interested more in DevOps or SRE. The section on practices is particularly useful because of its case studies for applying the SRE mindset to a variety of technical problems. Although most engineering teams will not face challenges on the same scale as Google, the principles and practices discussed in this book still apply.