Confidential ©2008–17 New Relic, Inc. All rights reserved 1
Welcome
Confidential ©2008–17 New Relic, Inc. All rights reserved 2
©2008–18 New Relic, Inc. All rights reserved
SRE-iously!
Defining the Principles, Habits, and Practices
of Site Reliability Engineering
Tori Wieldt, Developer Advocate, 08.06.2018
©2008–18 New Relic, Inc. All rights reserved
Tori Wieldt
Developer Advocate
@ToriWieldt
linkedin.com/in/toriwieldt
At the booth
Howdy Campers!
©2008–18 New Relic, Inc. All rights reserved 5
The SRE Handbook
But what if you’re not Google?
©2008–18 New Relic, Inc. All rights reserved 6
A Friendly Aside
The new Google SRE workbook
Free until August 23
landing.google.com/sre/book.html
Confidential ©2008–18 New Relic, Inc. All rights reserved
To
200+ Microservices
50+ Engineering
Teams with
embedded SREs
20-70 Deploys a Day
A Little Background
7
Ruby Monolith
Siloed teams
Infrequent Releases
From
©2008–18 New Relic, Inc. All rights reserved 8
How it was
On-Premises
On Premises
Relational Data
Customers
NoSQL
Data Store
Public Cloud
Micro Services
API
Mobile
Apps
Browser
©2008–18 New Relic, Inc. All rights reserved 9
We Asked Our Stakeholders
Why do we have
SREs at New Relic?
What’s the vision for
our SRE team?
How can SREs most
effectively contribute to the
future of our platform?
©2008–18 New Relic, Inc. All rights reserved 10
One Goal
Continuously improve
the reliability of systems in the
New Relic platform
©2008–18 New Relic, Inc. All rights reserved 11
Two Roles
“Pure” SRE
Build and support our
core internal platform:
Container Fabric
Networking Systems
Embedded SRE
Partner with Eng Teams
Domain Experts in:
Reliability
Tooling
Scaling
©2008–18 New Relic, Inc. All rights reserved
Three Spheres
12
Stability Reliability Engineering
©2008–18 New Relic, Inc. All rights reserved 13
What SREs Do
Champion reliability best practices.
Guide designs and processes with an eye toward resilience and low toil.
Reduce technical complexity and sprawl.
Drive the usage of tooling and common components.
Implement software and tooling to improve resilience and automate operations.
©2008–18 New Relic, Inc. All rights reserved 14
SRE Tasks
Work with teams to adopt operational best
practices
● Work with teams to update their risk matrices; audit
for missing or outdated runbooks; influence teams
to prioritize the most important reliability work.
● Work with teams to hold “game days” to test the
resilience of their systems against injected fault
conditions.
Stay current on our pipeline and build process,
and know the top risks for their team(s)
● Meet with architects and SREs on other teams to
discuss concerns and changes.
● Use state-of-production knowledge to guide team
risk matrices, operational processes, and priorities.
©2008–18 New Relic, Inc. All rights reserved 15
More SRE Tasks
Building, or helping teams adopt, core shared
internal platform components
● Work with teams to migrate systems into a new
version of our shared deployment pipeline.
● Contribute code or tools to our container runtime
platform.
● Limit technical sprawl by guiding teams to select
appropriate existing tools rather than building
new ones.
Improve the monitoring and observability of the New
Relic platform
● Work with teams to clean up noisy unused alerts
and ensure that important problems are alerted
on.
● Build an integration to our software
to create new visibility into our platform.
©2008–18 New Relic, Inc. All rights reserved 16
Even More SRE Tasks
Implement automation, tooling, and application
code to improve reliability and reduce toil.
● Identify a commonly used manual runbook and
automate it with software.
● Identify a common failure pattern for new
deployments and implement a system to
automatically detect and roll back that type of failed
deploy.
● Work with teams on the design of new services to
ensure those services will be scalable and robust.
● Update an application’s DB connection pool to use a
more reliable library.
Mentor less senior SREs and grow
the SRE community and practice
at our company
● Have a meeting, or lunch, once a week with a less
senior SRE to discuss work challenges and
solutions.
● Pair with other SREs experiencing problems you’ve
previously encountered or solved.
● Document and share novel solutions and other
effective strategies.
©2008–18 New Relic, Inc. All rights reserved 17
And Lastly
Perform task-based operational work (toil)
● Unblock teams with operational needs where
automated or self-service solutions do not yet exist
● Track down hardware defects on servers.
● Provision new network endpoints.
● Run Ansible playbooks.
©2008–18 New Relic, Inc. All rights reserved 18
Keys to SRE Success
Reliability is a feature Query your stakeholders
Reliability depends on shared
understanding
Develop clear, specific guidelines
SRE is a challenging, cross-
disciplinary practice
Build a strong SRE community
©2008–18 New Relic, Inc. All rights reserved 19
Determine Your Goal
Example:
Continuously improve
the reliability of the
systems of our
company’s platform.
1
Establish Roles
Examples:
Pure SRE
Embedded SRE
2
Focus Areas
Examples:
Stability
Reliability
Engineering
3
Thank You
©2008–18 New Relic, Inc. All rights reserved
@ToriWieldt
Thank You
©2008–18 New Relic, Inc. All rights reserved

SRE-iously! Defining the Principles, Habits, and Practices of Site Reliability Engineering

  • 1.
    Confidential ©2008–17 NewRelic, Inc. All rights reserved 1 Welcome
  • 2.
    Confidential ©2008–17 NewRelic, Inc. All rights reserved 2
  • 3.
    ©2008–18 New Relic,Inc. All rights reserved SRE-iously! Defining the Principles, Habits, and Practices of Site Reliability Engineering Tori Wieldt, Developer Advocate, 08.06.2018
  • 4.
    ©2008–18 New Relic,Inc. All rights reserved Tori Wieldt Developer Advocate @ToriWieldt linkedin.com/in/toriwieldt At the booth Howdy Campers!
  • 5.
    ©2008–18 New Relic,Inc. All rights reserved 5 The SRE Handbook But what if you’re not Google?
  • 6.
    ©2008–18 New Relic,Inc. All rights reserved 6 A Friendly Aside The new Google SRE workbook Free until August 23 landing.google.com/sre/book.html
  • 7.
    Confidential ©2008–18 NewRelic, Inc. All rights reserved To 200+ Microservices 50+ Engineering Teams with embedded SREs 20-70 Deploys a Day A Little Background 7 Ruby Monolith Siloed teams Infrequent Releases From
  • 8.
    ©2008–18 New Relic,Inc. All rights reserved 8 How it was On-Premises On Premises Relational Data Customers NoSQL Data Store Public Cloud Micro Services API Mobile Apps Browser
  • 9.
    ©2008–18 New Relic,Inc. All rights reserved 9 We Asked Our Stakeholders Why do we have SREs at New Relic? What’s the vision for our SRE team? How can SREs most effectively contribute to the future of our platform?
  • 10.
    ©2008–18 New Relic,Inc. All rights reserved 10 One Goal Continuously improve the reliability of systems in the New Relic platform
  • 11.
    ©2008–18 New Relic,Inc. All rights reserved 11 Two Roles “Pure” SRE Build and support our core internal platform: Container Fabric Networking Systems Embedded SRE Partner with Eng Teams Domain Experts in: Reliability Tooling Scaling
  • 12.
    ©2008–18 New Relic,Inc. All rights reserved Three Spheres 12 Stability Reliability Engineering
  • 13.
    ©2008–18 New Relic,Inc. All rights reserved 13 What SREs Do Champion reliability best practices. Guide designs and processes with an eye toward resilience and low toil. Reduce technical complexity and sprawl. Drive the usage of tooling and common components. Implement software and tooling to improve resilience and automate operations.
  • 14.
    ©2008–18 New Relic,Inc. All rights reserved 14 SRE Tasks Work with teams to adopt operational best practices ● Work with teams to update their risk matrices; audit for missing or outdated runbooks; influence teams to prioritize the most important reliability work. ● Work with teams to hold “game days” to test the resilience of their systems against injected fault conditions. Stay current on our pipeline and build process, and know the top risks for their team(s) ● Meet with architects and SREs on other teams to discuss concerns and changes. ● Use state-of-production knowledge to guide team risk matrices, operational processes, and priorities.
  • 15.
    ©2008–18 New Relic,Inc. All rights reserved 15 More SRE Tasks Building, or helping teams adopt, core shared internal platform components ● Work with teams to migrate systems into a new version of our shared deployment pipeline. ● Contribute code or tools to our container runtime platform. ● Limit technical sprawl by guiding teams to select appropriate existing tools rather than building new ones. Improve the monitoring and observability of the New Relic platform ● Work with teams to clean up noisy unused alerts and ensure that important problems are alerted on. ● Build an integration to our software to create new visibility into our platform.
  • 16.
    ©2008–18 New Relic,Inc. All rights reserved 16 Even More SRE Tasks Implement automation, tooling, and application code to improve reliability and reduce toil. ● Identify a commonly used manual runbook and automate it with software. ● Identify a common failure pattern for new deployments and implement a system to automatically detect and roll back that type of failed deploy. ● Work with teams on the design of new services to ensure those services will be scalable and robust. ● Update an application’s DB connection pool to use a more reliable library. Mentor less senior SREs and grow the SRE community and practice at our company ● Have a meeting, or lunch, once a week with a less senior SRE to discuss work challenges and solutions. ● Pair with other SREs experiencing problems you’ve previously encountered or solved. ● Document and share novel solutions and other effective strategies.
  • 17.
    ©2008–18 New Relic,Inc. All rights reserved 17 And Lastly Perform task-based operational work (toil) ● Unblock teams with operational needs where automated or self-service solutions do not yet exist ● Track down hardware defects on servers. ● Provision new network endpoints. ● Run Ansible playbooks.
  • 18.
    ©2008–18 New Relic,Inc. All rights reserved 18 Keys to SRE Success Reliability is a feature Query your stakeholders Reliability depends on shared understanding Develop clear, specific guidelines SRE is a challenging, cross- disciplinary practice Build a strong SRE community
  • 19.
    ©2008–18 New Relic,Inc. All rights reserved 19 Determine Your Goal Example: Continuously improve the reliability of the systems of our company’s platform. 1 Establish Roles Examples: Pure SRE Embedded SRE 2 Focus Areas Examples: Stability Reliability Engineering 3
  • 20.
    Thank You ©2008–18 NewRelic, Inc. All rights reserved @ToriWieldt
  • 21.
    Thank You ©2008–18 NewRelic, Inc. All rights reserved

Editor's Notes

  • #5  @ToriWieldt
  • #6 the Site Reliability Engineering book serves as a fantastic point of reference.
  • #8 New Relic went from a Ruby monolith, to a Ruby front-end and a Java backend, to microservices.
  • #10 Reliability is ultimately about the customer experience.
  • #11 All SREs at New Relic have one common goal
  • #13 -- "Stability" refers to the operational aspect of systems -- -- "Reliability" moves away from "keeping things alive" and towards "iteratively improving reliability." -- "Engineering" takes us into the realm of supporting the entire software lifecycle, and here we clearly diverge from the traditional ops role.