1
When metrics are not
enough, and everyone
is on-call
Name: Chris Riley
Title: Advocate | DevOps & DevRel
Organization: Splunk
Twitter: @hoardinginfo
Email: criley@splunk.com
2
Became an Advocate
If you can’t do it, talk about it
• Community engagement
• Increase understanding of market
Tried to be a developer &
product manager
Was an IT Pro
1995 -
2003
2003 -
2009
2009 -
now
Chris Riley
@hoardinginfo
33
Agenda
• The unicorn told me to do it
• Why should I care?
• What is “SRE” and “Observability”?
• On-Call for Modern Apps
44
The Unicorn Told Me To Do It
5
6
77
What Really Drives Change?
8
Gene Kim DOES 2019
The Unicorn Project
9
10
11
How Applications Are Being Built Is Changing
Monitored Environment
● Slow moving
● Infrequently changed
● Limited user transactions
Monolithic Architecture
Monitored Environment
● Distributed services (10s to 100s)
● Many hosts, Multi-Cloud
● High transaction volume
● Frequent code-pushes (CI/CD)
Microservices Architecture
© 2020 SPLUNK INC.
13
Aggressive Drive to Modernize
• The cost of downtime is going up
• Latent data is a huge opportunity cost
• Traditional infrastructure is impacting enterprises ability to compete
• Organizations want confidence they can respond to future crisis
• Technical talent requires it
1414
I give you … “Monitoring” I
mean “Observability”
15
1. Development and deployment strategy
2. Approach to monitoring applications
3. Tooling to make the added complexity easier
Observability Is:
16
1. Infra, Config, and Code are tied together
2. Metrics are not enough
3. Applications are increasingly distributed
4. Application components are stateless and ephemeral
Observability When:
1717
The SRE is
Observability’s
best friend
18
“SRE is about being customer
obsessed.”
19
Because latency is the
new down.
20
1. Modernize the NOC
2. Keep pace with release velocity
3. Customers demand more
4. Development teams need an operational partner
21
Before and After the NOC
○ Spreadsheets managed who to call
○ 24x7 staffed operations centers
○ NOCs abilities were limited to infra
○ IT focused with little dev experience
○ Spray and pray OR lazy mobilization
○ Automation is mandatory
○ Application layer is part of production support
○ “Anyone” can be on-call
○ Both a Strategy and a Role
Network Operations Model SRE Model
22
Responsibility of SRE
• Strategy
• Metrics (RED, USE, Etc.)
• Deploy Prep
• Stewardship
• Operations
• Owners of On-Call
2323
Your app just called… it
wants its resources back
24
Alert & Context
Observability
Mobilization and Action
Incident Response
Record and Track
Incident Management
25
Alert Incident Response Incident Management
26
Alert Fired
Rules Engine
Routing Key
Incident Created
Escalation Policy
Rotation
App User
Paging Policy
ITSM
Collaboration
Webhook / Automation
Monitoring / Observability Tool Incident Response Notifications
Page
(alert payload)
27
Response Remediation Resolution
NOC
Notices
Problem
NOC
pages On-
Call User
Page is
“acked”
Code is
Deployed
Service is
Restored
The Typical Incident Lifecycle Is Confusing and Slow
25-45 min 6 hours / 5 re-routes / 8 people
Simplify Incident Response
Response Remediation Resolution
Monitoring
tool
alert
On-Call
User
Paged
Page is
“acked”
Code is
Deployed
Service is
Restored
<2 min 2 hours / 0 re-routes / 3 people
Before
25-45 min
Before
6 hours / 5 re-routes / 8 people
28
• How Splunk Does SRE: https://www.splunk.com/en_us/blog/it/the-sre-dogfood-series-signalfx-sre-team.html
• Modernize The NOC: https://devops.com/moving-from-noc-to-the-sre-model/
• SRE Strategy Webinar: https://victorops.hubs.vidyard.com/watch/bqyuTmgC48kj9wQizSZ91K
• Developers Eating the World: www.sweetcode.io/detw
• OpenTelemetry Project: https://opentelemetry.io/
2929
THANK YOU!
Meet Me in the Network
Chat Lounge for Questions

Shift Remote: DevOps: When metrics are not enough, and everyone is on-call - Chris Riley (splunk)

  • 1.
    1 When metrics arenot enough, and everyone is on-call Name: Chris Riley Title: Advocate | DevOps & DevRel Organization: Splunk Twitter: @hoardinginfo Email: criley@splunk.com
  • 2.
    2 Became an Advocate Ifyou can’t do it, talk about it • Community engagement • Increase understanding of market Tried to be a developer & product manager Was an IT Pro 1995 - 2003 2003 - 2009 2009 - now Chris Riley @hoardinginfo
  • 3.
    33 Agenda • The unicorntold me to do it • Why should I care? • What is “SRE” and “Observability”? • On-Call for Modern Apps
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
    8 Gene Kim DOES2019 The Unicorn Project
  • 9.
  • 10.
  • 11.
    11 How Applications AreBeing Built Is Changing Monitored Environment ● Slow moving ● Infrequently changed ● Limited user transactions Monolithic Architecture Monitored Environment ● Distributed services (10s to 100s) ● Many hosts, Multi-Cloud ● High transaction volume ● Frequent code-pushes (CI/CD) Microservices Architecture
  • 12.
  • 13.
    13 Aggressive Drive toModernize • The cost of downtime is going up • Latent data is a huge opportunity cost • Traditional infrastructure is impacting enterprises ability to compete • Organizations want confidence they can respond to future crisis • Technical talent requires it
  • 14.
    1414 I give you… “Monitoring” I mean “Observability”
  • 15.
    15 1. Development anddeployment strategy 2. Approach to monitoring applications 3. Tooling to make the added complexity easier Observability Is:
  • 16.
    16 1. Infra, Config,and Code are tied together 2. Metrics are not enough 3. Applications are increasingly distributed 4. Application components are stateless and ephemeral Observability When:
  • 17.
  • 18.
    18 “SRE is aboutbeing customer obsessed.”
  • 19.
    19 Because latency isthe new down.
  • 20.
    20 1. Modernize theNOC 2. Keep pace with release velocity 3. Customers demand more 4. Development teams need an operational partner
  • 21.
    21 Before and Afterthe NOC ○ Spreadsheets managed who to call ○ 24x7 staffed operations centers ○ NOCs abilities were limited to infra ○ IT focused with little dev experience ○ Spray and pray OR lazy mobilization ○ Automation is mandatory ○ Application layer is part of production support ○ “Anyone” can be on-call ○ Both a Strategy and a Role Network Operations Model SRE Model
  • 22.
    22 Responsibility of SRE •Strategy • Metrics (RED, USE, Etc.) • Deploy Prep • Stewardship • Operations • Owners of On-Call
  • 23.
    2323 Your app justcalled… it wants its resources back
  • 24.
    24 Alert & Context Observability Mobilizationand Action Incident Response Record and Track Incident Management
  • 25.
    25 Alert Incident ResponseIncident Management
  • 26.
    26 Alert Fired Rules Engine RoutingKey Incident Created Escalation Policy Rotation App User Paging Policy ITSM Collaboration Webhook / Automation Monitoring / Observability Tool Incident Response Notifications Page (alert payload)
  • 27.
    27 Response Remediation Resolution NOC Notices Problem NOC pagesOn- Call User Page is “acked” Code is Deployed Service is Restored The Typical Incident Lifecycle Is Confusing and Slow 25-45 min 6 hours / 5 re-routes / 8 people Simplify Incident Response Response Remediation Resolution Monitoring tool alert On-Call User Paged Page is “acked” Code is Deployed Service is Restored <2 min 2 hours / 0 re-routes / 3 people Before 25-45 min Before 6 hours / 5 re-routes / 8 people
  • 28.
    28 • How SplunkDoes SRE: https://www.splunk.com/en_us/blog/it/the-sre-dogfood-series-signalfx-sre-team.html • Modernize The NOC: https://devops.com/moving-from-noc-to-the-sre-model/ • SRE Strategy Webinar: https://victorops.hubs.vidyard.com/watch/bqyuTmgC48kj9wQizSZ91K • Developers Eating the World: www.sweetcode.io/detw • OpenTelemetry Project: https://opentelemetry.io/
  • 29.
    2929 THANK YOU! Meet Mein the Network Chat Lounge for Questions