TRACK: SITE RELIABILTY ENGINEERING
NOVEMBER 12, 2020
Deepak Ramchandani, Contino
Driving Digital
Transformation
through CloudOps
and SRE
TRACK: SITE RELIABILTY ENGINEERING
Deepak Vensi
Account Principal @ Contino
linkedin.com/in/deepakrv/
● Multi-Cloud adoption within Regulated Industries
● Advocate & build in-house mature Engineering practices with a focus on
SRE
● Changing the Operating Model with a focus on FinOps, GitOps &
DevSecOps
● Help develop Cloud Native Products and Services for Enterprises
● Build sustainable in-house digital capabilities for long term adoption
TRACK: SITE RELIABILTY ENGINEERING
• Core functions part of IT Operations
• What does Operations mean in Cloud
• Why SRE?
• How can you bring Reliability / Docs / Code
/ Controls together?
• How to upskills a team towards SRE
Agenda
TRACK: SITE RELIABILTY ENGINEERING
• Incident Management
• Change Management
• Release Management
• Configuration Management
• Capacity Management
• Business Continuity & Backup
• Asset Management
• Demand Management
• Knowledge Management
• Risk Management
Core Functions of Traditional IT
Operations
TRACK: SITE RELIABILTY ENGINEERING
What does a “Pipeline” look like?
Prereqs Request Define Qualify
Materiality &
Risk
Data
Assessment
BC/ Resilience Architecture Build in DEV
Secure @
Design
Service
Transition
Build in PROD
Queue Queue Queue Queue Queue Queue Queue Queue Queue Queue Queue Queue
Secure @
Build
Endorse Promote
Queue Queue
TRACK: SITE RELIABILTY ENGINEERING
• Reliability
• Security
• Operability
• Predictability
• “Control”
What do IT Ops really want?
TRACK: SITE RELIABILTY ENGINEERING
The Gap to Bridge
Traditional IT
Ops
Public Cloud &
Digital
New Modes of
Operations
TRACK: SITE RELIABILTY ENGINEERING
Site reliability engineering (SRE) is a discipline
that incorporates aspects of software engineering
and applies them to infrastructure and operations
problems.[1] The main goals are to create
scalable and highly reliable software systems.
According to Ben Treynor, founder of Google's
Site Reliability Team, SRE is "what happens when
a software engineer is tasked with what used to
be called operations."
SRE to the Rescue!
TRACK: SITE RELIABILTY ENGINEERING
The four key pillars to CloudOps
SRE FinOps GitOps DevSecOps
TRACK: SITE RELIABILTY ENGINEERING
SRE to the Rescue!
SRE Stepping stones:
lay the foundation first
Reliability Hierarchy
of Needs
Review architecture for reliability
SLx
Monitoring & Alerting
Testing & Release
Capacity planning
Post Mortem/RCA
Development
Product
TRACK: SITE RELIABILTY ENGINEERING
Fear not!
You Build it you Run it
Design Run
Build
Act like a
Developer, Think
like a Systems
Operator
TRACK: SITE RELIABILTY ENGINEERING
Roles and Responsibilities
Site Reliability Engineer
“Act like a Developer, think
like a Systems Operator”.
4. Eliminate Toil via
Automation
2. Create Service Level
Indicators & Objectives
1. Consult with
Consumers
3. Design, Build & Run
Platforms / Services / Apps
5. Monitor Distributed
Systems & Develop Products
9. Emergency Response
& On-call Support
6. Postmortems & Learn
from Failure
8. Track Platform
Outages
7. Manage Platform &
Consumer Incidents
TRACK: SITE RELIABILTY ENGINEERING
How to get started?
TRACK: SITE RELIABILTY ENGINEERING
Team Topology
Cloud Platforms
Landing Zones | API Gateway |
Patterns | Shared Services
Platform Team(s)
Team 1
Product 3
Application 2
Service 4
Stream Aligned Team(s)
Developer
Experience
Enablement
Team
Customers: Internal Product/App
teams
Reliability: Quite high!
Customers: External facing services
Reliability: Service dependant
TRACK: SITE RELIABILTY ENGINEERING
Team Topology
Cloud Platforms
Landing Zones | API Gateway |
Patterns | Shared Services
Platform Team(s)
Team 1
Product 3
Application 2
Service 4
Stream Aligned Team(s)
Mandates for Success
● Defining the “Team API” for Platform Teams
● Cloud Platforms enable the demand from
the stream aligned teams, not dictate them
● Platforms Teams roadmap and priorities
are public and open
● Platforms Teams have wiki based
documentation available for consumption
● Team dependencies should be reduced to
enable Flow
● Developer experience metrics should be
tracked for Platform Teams, along with the
4 key metrics
● Technical Decisions have to be made based
on Team Cognitive Load
● The Three key Interaction Models:
Facilitate, Collaborate, X as a Service
Developer
Experience
Enablement
Team
TRACK: SITE RELIABILTY ENGINEERING
Code-Docs-Controls-Reliability
Architectural Decision
Record (ADR)
The good stuff Documentation as code
Dashboards
TRACK: SITE RELIABILTY ENGINEERING
Platform
Operations
Cloud Platform Team
Applications
Engineering
Application
STRATEGIC
Platform
Operations
Cloud Platform Team
Applications
Engineering
Application
STRATEGIC
Platform
Operations
Cloud Platform Team
Applications
Engineering
Applications
Engineering
Applications Operations
TRANSITIONAL
Cloud Managed Services
Mode 1
“Traditional Operations”
Mode 2
“Distributed Ops”
Mode 3
“Decentralised Ops”
TRACK: SITE RELIABILTY ENGINEERING
• Kitchen Sink, a.k.a. “Everything SRE”
• Infrastructure / Platforms
• Tools
• Product/application
• Embedded
• Consulting
How get organised and started?
https://cloud.google.com/blog/products/devops-sre/how-sre-teams-are-organized-and-how-to-get-started
TRACK: SITE RELIABILTY ENGINEERING
• Infrastructure / Platforms
• Developer Experience / Developer Velocity
• Function - Stability - Ease of Use - Clarity -
• Product/application
• DORA Metrics:
• Lead Time for Changes - Change Failure Rate - Time
to Restore - Deployment Frequency Availability
What to Measure
TRACK: SITE RELIABILTY ENGINEERING
How to upskills a team towards
SRE
TRACK: SITE RELIABILTY ENGINEERING
Get the basics right
• Find out what type of SRE team you are
• Start with a small product / platform
• Identify your customers
• Figure out what you need to measure
• Agree the metrics WITH your customers
• Mesure - Improve - Fail - Improve - Measure
TRACK: SITE RELIABILTY ENGINEERING
How to scale
• Align engineering OKRs to the organisation
metrics
• Start running chaos engineering practices
across teams for product improvements
• Start benchmarking your SRE maturity
• Use engineering mitosis to scale SRE
practices
TRACK: SITE RELIABILTY ENGINEERING
A game day simulates a failure or event to test systems,
processes, and team responses. The purpose is to actually
perform the actions the team would perform as if an exceptional
event happened.
● To test whether or not the current systems are more or less resilient; with
the adequate processes to support it
● To build the "muscle memory" of the team(s) on how to respond if an
exceptional event happened.
● To evaluate a team's ability to design-build-run systems taking into
consideration the well architected pillars of operations, security,
reliability, performance, and cost.
● To test all aspects of one’s business for Reliability Readiness; specifically
operations, test, development, security, business operations, and
business leaders.
● Game Days are run in order to improve the availability of a system with
the goal of increasing reliability by purposefully creating major failures on
a regular basis.
“
What is a Game Day?
TRACK: SITE RELIABILTY ENGINEERING
● https://landing.google.com/sre/
● https://www.gremlin.com/blog
● Reliability Engineering at the Core of
Continuous Innovation
● Boost Your Apps With An SRE Approach
to Development
TRACK: SITE RELIABILTY ENGINEERING
THANK YOU TO OUR SPONSORS

ADDO_2020-Driving-Digital-Transformation-through-CloudOps-and-SRE.pdf

  • 1.
    TRACK: SITE RELIABILTYENGINEERING NOVEMBER 12, 2020 Deepak Ramchandani, Contino Driving Digital Transformation through CloudOps and SRE
  • 2.
    TRACK: SITE RELIABILTYENGINEERING Deepak Vensi Account Principal @ Contino linkedin.com/in/deepakrv/ ● Multi-Cloud adoption within Regulated Industries ● Advocate & build in-house mature Engineering practices with a focus on SRE ● Changing the Operating Model with a focus on FinOps, GitOps & DevSecOps ● Help develop Cloud Native Products and Services for Enterprises ● Build sustainable in-house digital capabilities for long term adoption
  • 3.
    TRACK: SITE RELIABILTYENGINEERING • Core functions part of IT Operations • What does Operations mean in Cloud • Why SRE? • How can you bring Reliability / Docs / Code / Controls together? • How to upskills a team towards SRE Agenda
  • 4.
    TRACK: SITE RELIABILTYENGINEERING • Incident Management • Change Management • Release Management • Configuration Management • Capacity Management • Business Continuity & Backup • Asset Management • Demand Management • Knowledge Management • Risk Management Core Functions of Traditional IT Operations
  • 5.
    TRACK: SITE RELIABILTYENGINEERING What does a “Pipeline” look like? Prereqs Request Define Qualify Materiality & Risk Data Assessment BC/ Resilience Architecture Build in DEV Secure @ Design Service Transition Build in PROD Queue Queue Queue Queue Queue Queue Queue Queue Queue Queue Queue Queue Secure @ Build Endorse Promote Queue Queue
  • 6.
    TRACK: SITE RELIABILTYENGINEERING • Reliability • Security • Operability • Predictability • “Control” What do IT Ops really want?
  • 7.
    TRACK: SITE RELIABILTYENGINEERING The Gap to Bridge Traditional IT Ops Public Cloud & Digital New Modes of Operations
  • 8.
    TRACK: SITE RELIABILTYENGINEERING Site reliability engineering (SRE) is a discipline that incorporates aspects of software engineering and applies them to infrastructure and operations problems.[1] The main goals are to create scalable and highly reliable software systems. According to Ben Treynor, founder of Google's Site Reliability Team, SRE is "what happens when a software engineer is tasked with what used to be called operations." SRE to the Rescue!
  • 9.
    TRACK: SITE RELIABILTYENGINEERING The four key pillars to CloudOps SRE FinOps GitOps DevSecOps
  • 10.
    TRACK: SITE RELIABILTYENGINEERING SRE to the Rescue! SRE Stepping stones: lay the foundation first Reliability Hierarchy of Needs Review architecture for reliability SLx Monitoring & Alerting Testing & Release Capacity planning Post Mortem/RCA Development Product
  • 11.
    TRACK: SITE RELIABILTYENGINEERING Fear not! You Build it you Run it Design Run Build Act like a Developer, Think like a Systems Operator
  • 12.
    TRACK: SITE RELIABILTYENGINEERING Roles and Responsibilities Site Reliability Engineer “Act like a Developer, think like a Systems Operator”. 4. Eliminate Toil via Automation 2. Create Service Level Indicators & Objectives 1. Consult with Consumers 3. Design, Build & Run Platforms / Services / Apps 5. Monitor Distributed Systems & Develop Products 9. Emergency Response & On-call Support 6. Postmortems & Learn from Failure 8. Track Platform Outages 7. Manage Platform & Consumer Incidents
  • 13.
    TRACK: SITE RELIABILTYENGINEERING How to get started?
  • 14.
    TRACK: SITE RELIABILTYENGINEERING Team Topology Cloud Platforms Landing Zones | API Gateway | Patterns | Shared Services Platform Team(s) Team 1 Product 3 Application 2 Service 4 Stream Aligned Team(s) Developer Experience Enablement Team Customers: Internal Product/App teams Reliability: Quite high! Customers: External facing services Reliability: Service dependant
  • 15.
    TRACK: SITE RELIABILTYENGINEERING Team Topology Cloud Platforms Landing Zones | API Gateway | Patterns | Shared Services Platform Team(s) Team 1 Product 3 Application 2 Service 4 Stream Aligned Team(s) Mandates for Success ● Defining the “Team API” for Platform Teams ● Cloud Platforms enable the demand from the stream aligned teams, not dictate them ● Platforms Teams roadmap and priorities are public and open ● Platforms Teams have wiki based documentation available for consumption ● Team dependencies should be reduced to enable Flow ● Developer experience metrics should be tracked for Platform Teams, along with the 4 key metrics ● Technical Decisions have to be made based on Team Cognitive Load ● The Three key Interaction Models: Facilitate, Collaborate, X as a Service Developer Experience Enablement Team
  • 16.
    TRACK: SITE RELIABILTYENGINEERING Code-Docs-Controls-Reliability Architectural Decision Record (ADR) The good stuff Documentation as code Dashboards
  • 17.
    TRACK: SITE RELIABILTYENGINEERING Platform Operations Cloud Platform Team Applications Engineering Application STRATEGIC Platform Operations Cloud Platform Team Applications Engineering Application STRATEGIC Platform Operations Cloud Platform Team Applications Engineering Applications Engineering Applications Operations TRANSITIONAL Cloud Managed Services Mode 1 “Traditional Operations” Mode 2 “Distributed Ops” Mode 3 “Decentralised Ops”
  • 18.
    TRACK: SITE RELIABILTYENGINEERING • Kitchen Sink, a.k.a. “Everything SRE” • Infrastructure / Platforms • Tools • Product/application • Embedded • Consulting How get organised and started? https://cloud.google.com/blog/products/devops-sre/how-sre-teams-are-organized-and-how-to-get-started
  • 19.
    TRACK: SITE RELIABILTYENGINEERING • Infrastructure / Platforms • Developer Experience / Developer Velocity • Function - Stability - Ease of Use - Clarity - • Product/application • DORA Metrics: • Lead Time for Changes - Change Failure Rate - Time to Restore - Deployment Frequency Availability What to Measure
  • 20.
    TRACK: SITE RELIABILTYENGINEERING How to upskills a team towards SRE
  • 21.
    TRACK: SITE RELIABILTYENGINEERING Get the basics right • Find out what type of SRE team you are • Start with a small product / platform • Identify your customers • Figure out what you need to measure • Agree the metrics WITH your customers • Mesure - Improve - Fail - Improve - Measure
  • 22.
    TRACK: SITE RELIABILTYENGINEERING How to scale • Align engineering OKRs to the organisation metrics • Start running chaos engineering practices across teams for product improvements • Start benchmarking your SRE maturity • Use engineering mitosis to scale SRE practices
  • 23.
    TRACK: SITE RELIABILTYENGINEERING A game day simulates a failure or event to test systems, processes, and team responses. The purpose is to actually perform the actions the team would perform as if an exceptional event happened. ● To test whether or not the current systems are more or less resilient; with the adequate processes to support it ● To build the "muscle memory" of the team(s) on how to respond if an exceptional event happened. ● To evaluate a team's ability to design-build-run systems taking into consideration the well architected pillars of operations, security, reliability, performance, and cost. ● To test all aspects of one’s business for Reliability Readiness; specifically operations, test, development, security, business operations, and business leaders. ● Game Days are run in order to improve the availability of a system with the goal of increasing reliability by purposefully creating major failures on a regular basis. “ What is a Game Day?
  • 24.
    TRACK: SITE RELIABILTYENGINEERING ● https://landing.google.com/sre/ ● https://www.gremlin.com/blog ● Reliability Engineering at the Core of Continuous Innovation ● Boost Your Apps With An SRE Approach to Development
  • 25.
    TRACK: SITE RELIABILTYENGINEERING THANK YOU TO OUR SPONSORS