Site Reliability Engineering - How Google Runs Production Systems

Niall Richard Murphy, Betsy Beyer, Chris Jones, Jennifer Petoff

Key Insights from the Book:

  1. Site Reliability Engineering (SRE) is Google's innovative approach to IT operations, aiming to keep systems up and running while allowing for constant updates and improvements.
  2. At its core, SRE is about balancing risk — the risk of system instability against the risk of stifling innovation.
  3. The concept of error budget is introduced as a means of measuring system reliability and guiding decisions about when to push new changes.
  4. The 'Four Golden Signals' — Latency, Traffic, Errors, and Saturation — are key metrics in monitoring system health.
  5. SRE emphasizes automation to eliminate toil and improve...

    Please log in or register to view the full book summary.

Please log in or register to view the video summary.

Vinay Hegde
Not available

Vinay Hegde IN

Senior DevOps Engineer, Drip Capital
Not available


Enterprise Architect , ING