Shift Remote: DevOps: When metrics are not enough, and everyone is on-call - Chris Riley (splunk)

1
When metrics are not
enough, and everyone
is on-call
Name: Chris Riley
Title: Advocate | DevOps & DevRel
Organization: Splunk
Twitter: @hoardinginfo
Email: criley@splunk.com

2
Became an Advocate
If you can’t do it, talk about it
• Community engagement
• Increase understanding of market
Tried to be a developer &
product manager
Was an IT Pro
1995 -
2003
2003 -
2009
2009 -
now
Chris Riley
@hoardinginfo

33
Agenda
• The unicorn told me to do it
• Why should I care?
• What is “SRE” and “Observability”?
• On-Call for Modern Apps

44
The Unicorn Told Me To Do It

8
Gene Kim DOES 2019
The Unicorn Project

11
How Applications Are Being Built Is Changing
Monitored Environment
● Slow moving
● Infrequently changed
● Limited user transactions
Monolithic Architecture
Monitored Environment
● Distributed services (10s to 100s)
● Many hosts, Multi-Cloud
● High transaction volume
● Frequent code-pushes (CI/CD)
Microservices Architecture

13
Aggressive Drive to Modernize
• The cost of downtime is going up
• Latent data is a huge opportunity cost
• Traditional infrastructure is impacting enterprises ability to compete
• Organizations want confidence they can respond to future crisis
• Technical talent requires it

1414
I give you … “Monitoring” I
mean “Observability”

15
1. Development and deployment strategy
2. Approach to monitoring applications
3. Tooling to make the added complexity easier
Observability Is:

16
1. Infra, Config, and Code are tied together
2. Metrics are not enough
3. Applications are increasingly distributed
4. Application components are stateless and ephemeral
Observability When:

1717
The SRE is
Observability’s
best friend

18
“SRE is about being customer
obsessed.”

19
Because latency is the
new down.

20
1. Modernize the NOC
2. Keep pace with release velocity
3. Customers demand more
4. Development teams need an operational partner

21
Before and After the NOC
○ Spreadsheets managed who to call
○ 24x7 staffed operations centers
○ NOCs abilities were limited to infra
○ IT focused with little dev experience
○ Spray and pray OR lazy mobilization
○ Automation is mandatory
○ Application layer is part of production support
○ “Anyone” can be on-call
○ Both a Strategy and a Role
Network Operations Model SRE Model

22
Responsibility of SRE
• Strategy
• Metrics (RED, USE, Etc.)
• Deploy Prep
• Stewardship
• Operations
• Owners of On-Call

2323
Your app just called… it
wants its resources back

24
Alert & Context
Observability
Mobilization and Action
Incident Response
Record and Track
Incident Management

25
Alert Incident Response Incident Management

26
Alert Fired
Rules Engine
Routing Key
Incident Created
Escalation Policy
Rotation
App User
Paging Policy
ITSM
Collaboration
Webhook / Automation
Monitoring / Observability Tool Incident Response Notifications
Page
(alert payload)

27
Response Remediation Resolution
NOC
Notices
Problem
NOC
pages On-
Call User
Page is
“acked”
Code is
Deployed
Service is
Restored
The Typical Incident Lifecycle Is Confusing and Slow
25-45 min 6 hours / 5 re-routes / 8 people
Simplify Incident Response
Response Remediation Resolution
Monitoring
tool
alert
On-Call
User
Paged
Page is
“acked”
Code is
Deployed
Service is
Restored
<2 min 2 hours / 0 re-routes / 3 people
Before
25-45 min
Before
6 hours / 5 re-routes / 8 people

28
• How Splunk Does SRE: https://www.splunk.com/en_us/blog/it/the-sre-dogfood-series-signalfx-sre-team.html
• Modernize The NOC: https://devops.com/moving-from-noc-to-the-sre-model/
• SRE Strategy Webinar: https://victorops.hubs.vidyard.com/watch/bqyuTmgC48kj9wQizSZ91K
• Developers Eating the World: www.sweetcode.io/detw
• OpenTelemetry Project: https://opentelemetry.io/

2929
THANK YOU!
Meet Me in the Network
Chat Lounge for Questions

Shift Remote: DevOps: When metrics are not enough, and everyone is on-call - Chris Riley (splunk)

More Related Content

What's hot

Similar to Shift Remote: DevOps: When metrics are not enough, and everyone is on-call - Chris Riley (splunk)

More from Shift Conference

Recently uploaded

Shift Remote: DevOps: When metrics are not enough, and everyone is on-call - Chris Riley (splunk)