© 2020 SPLUNK INC.
10 Murphy’s Laws of
Observability
And related guests
Dave McAllister
© 2020 SPLUNK INC.
Senior Technical Evangelist
Dave McAllister
© 2020 SPLUNK INC.
© 2020 SPLUNK INC.
“ Whatever can go wrong, will go wrong”
© 2020 SPLUNK INC.
“ Whatever can go wrong, will go wrong
at the worst possible time”
© 2020 SPLUNK INC.
There are lots of Murphy’s categories
On Cooking
On Cars
On Physics
On measurements
On Vacations
• Murphy’s Technology
Laws
• Murphy’s Military Laws
• Murphy’s Laws on Love
and Sex
And spin offs
• Abbott’s Admonitions
• Allen’s Axioms
© 2020 SPLUNK INC.
Murphy’s for Observability #1
If you perceive that there are
four possible ways in which a
procedure can go wrong, and
circumvent these, then a fifth
way, unprepared for, will
promptly develop.
© 2020 SPLUNK INC.
A Brief View of Observability
TL;DR: Observability is a quality of software, services, platforms, or products that
allows us to understand how systems are behaving.
For Engineering purposes: Designing / defining the exposure of state variables in
a manner to allow inference of internal behavior
© 2020 SPLUNK INC.
Observability
is a Data
Problem
The more observable a system,
the quicker we can understand
why it’s acting up and fix it Metrics
Do I have
a problem?
Traces
Where is the
problem?
Logs
Why is the problem
happening?
DETECT TROUBLESHOOT ROOT CAUSE
Full-Stack Visibility
& Context-Rich Insights
© 2020 SPLUNK INC.
Murphy’s for Observability #2
DATA
Every Solution Breeds
New Problems
© 2020 SPLUNK INC.
Complex
Emergent
Probe
Sense
Respond
Complicated
Good Practice
Sense
Analyze
Respond
Chaotic
Novel
Act
Sense
Respond
Simple
Best Practice
Sense
Categorize
Respond
• Microservices create complex
interactions.
• Failures don't exactly repeat.
• Debugging multi-tenancy is
painful.
• Monitoring alone can no longer
save us.
Observability Challenges
Cynefin Framework
Disorder
Microservices
Elastic and Ephemeral
© 2020 SPLUNK INC.
Murphy’s for Observability #3
You can never run out of
things that can go wrong
© 2020 SPLUNK INC.
Observability Allows Us to Monitor For the
Unknown Unknowns
Today’s knowns are yesterday unknowns
Known Unknown
Known
Things we are aware of AND
understand
Things we are aware of but DON’T
understand
Unknown
Things we are NOT aware of but
understand
Things we are NOT aware of and DON’T
understand
Monitoring
Observability
© 2020 SPLUNK INC.
Murphy’s for Observability #4
Nothing is as easy as it
looks
© 2020 SPLUNK INC.
EXAMPLE MICROSERVICE ARCHITECTURE
© 2020 SPLUNK INC.
Complexity
Drift and Skew
Ephemeral Behavior
Cloud-compute Elasticity
© 2020 SPLUNK INC.
Murphy’s for Observability #5
Things get worse under
pressure
© 2020 SPLUNK INC.
All about scale
© 2020 SPLUNK INC.
• Kubernetes objects
• Backend services
• Deployed microservices
• Frequency of deployments
• Dimensions (e.g. pod labels) and high-cardinality
• Streaming vs batch & query analytics
• Alerting on multiple metric time series
Image source:
https://github.com/kubernetes/community/blob/master/
sig-scalability/configs-and-limits/thresholds.md
The Scalability Envelope
System scale is multi-dimensional
© 2020 SPLUNK INC.
Murphy’s for Observability #6
If it is not in the
computer, it doesn’t
exist
© 2020 SPLUNK INC.
Sampling No Sampling
© 2020 SPLUNK INC.
Murphy’s for Observability #7
Availability is a function
of time
© 2 0 2 0 S P L U N K I N C .
The resolution and speed of the
data directly impact the insights
you gain
© 2 0 2 0 S P L U N K I N C .
Interchangeable?
• Accuracy is that the measure is correct
• Precise means it is consistent with other measurements
Observability depends on both
But aggregation and analysis can skew this
Discussing accuracy and precision
© 2020 SPLUNK INC.
Missing the point
10 sec average =13.9
95% = 27.05
First 5 sec average =16.4
95% = 29.2
Second 5 sec average =11.4
95% = 19.4
© 2020 SPLUNK INC.
Data resolution ≠ Reporting resolution
• But both can be problematic
• Always deliver all data points regardless of reporting
• Finer granularity means more potential precision
© 2020 SPLUNK INC.
Murphy’s for Observability #8
If anything cannot go
wrong, it will anyway
© 2020 SPLUNK INC.
Facets of Technology
Backend
Infra. Monitoring
Incident Response
APM
Code Profiling
Dashboards
Events
and
Logs
Web User
Mobile User
On-prem
servers
Cloud Network VM Container Serverless
Packaged Apps Microservices
Supply Chain Online Services Digital Experience
Frontend
Synthetic
Monitoring RUM
Synthetics User
Monitoring
Endpoint
Monitoring
Environments
Aggregation, Analysis,
Visualization, Response
Network Performance Monitoring
© 2020 SPLUNK INC.
Murphy’s for Observability #9
Whenever you set out to
do something,
something else must be
done first.
© 2020 SPLUNK INC.
From Observability 1.0 to 2.0
Thanks to Kevin Brockhoff
© 2020 SPLUNK INC.
What is OpenTelemetry?
OpenCensus
+ =
OpenTelemetry: the next major version
of both OpenTracing and OpenCensus
© 2020 SPLUNK INC.
Ashley-Perry Statistical Axiom
Numbers are tools, not
rules
© 2020 SPLUNK INC.
Predictive behavior
Sometimes you want to know what’s
coming
• Prediction is only as good as the data
precision and accuracy
• Historic versus Sudden Change
• (Trend) Stationary
• Expect false positives (and negatives)
© 2020 SPLUNK INC.
Baker’s Law
Misery no longer loves
company.
Now it insists on it
© 2020 SPLUNK INC.
Hills Commentaries
• If we lose much by having things go
wrong, take all possible care
• If we have nothing to lose by change,
relax
• If we have everything to gain by
change, relax
• If it doesn’t matter, it does not matter
McAllister Corollary: Until it does
© 2020 SPLUNK INC.
Murphy’s for Observability #10
All’s well that ends
Thank You
© 2020 SPLUNK INC.
https://wwww.linkedin.com/in/davemc

Murphys laws for Observability

  • 1.
    © 2020 SPLUNKINC. 10 Murphy’s Laws of Observability And related guests Dave McAllister
  • 2.
    © 2020 SPLUNKINC. Senior Technical Evangelist Dave McAllister
  • 3.
  • 4.
    © 2020 SPLUNKINC. “ Whatever can go wrong, will go wrong”
  • 5.
    © 2020 SPLUNKINC. “ Whatever can go wrong, will go wrong at the worst possible time”
  • 6.
    © 2020 SPLUNKINC. There are lots of Murphy’s categories On Cooking On Cars On Physics On measurements On Vacations • Murphy’s Technology Laws • Murphy’s Military Laws • Murphy’s Laws on Love and Sex And spin offs • Abbott’s Admonitions • Allen’s Axioms
  • 7.
    © 2020 SPLUNKINC. Murphy’s for Observability #1 If you perceive that there are four possible ways in which a procedure can go wrong, and circumvent these, then a fifth way, unprepared for, will promptly develop.
  • 8.
    © 2020 SPLUNKINC. A Brief View of Observability TL;DR: Observability is a quality of software, services, platforms, or products that allows us to understand how systems are behaving. For Engineering purposes: Designing / defining the exposure of state variables in a manner to allow inference of internal behavior
  • 9.
    © 2020 SPLUNKINC. Observability is a Data Problem The more observable a system, the quicker we can understand why it’s acting up and fix it Metrics Do I have a problem? Traces Where is the problem? Logs Why is the problem happening? DETECT TROUBLESHOOT ROOT CAUSE Full-Stack Visibility & Context-Rich Insights
  • 10.
    © 2020 SPLUNKINC. Murphy’s for Observability #2 DATA Every Solution Breeds New Problems
  • 11.
    © 2020 SPLUNKINC. Complex Emergent Probe Sense Respond Complicated Good Practice Sense Analyze Respond Chaotic Novel Act Sense Respond Simple Best Practice Sense Categorize Respond • Microservices create complex interactions. • Failures don't exactly repeat. • Debugging multi-tenancy is painful. • Monitoring alone can no longer save us. Observability Challenges Cynefin Framework Disorder Microservices Elastic and Ephemeral
  • 12.
    © 2020 SPLUNKINC. Murphy’s for Observability #3 You can never run out of things that can go wrong
  • 13.
    © 2020 SPLUNKINC. Observability Allows Us to Monitor For the Unknown Unknowns Today’s knowns are yesterday unknowns Known Unknown Known Things we are aware of AND understand Things we are aware of but DON’T understand Unknown Things we are NOT aware of but understand Things we are NOT aware of and DON’T understand Monitoring Observability
  • 14.
    © 2020 SPLUNKINC. Murphy’s for Observability #4 Nothing is as easy as it looks
  • 15.
    © 2020 SPLUNKINC. EXAMPLE MICROSERVICE ARCHITECTURE
  • 16.
    © 2020 SPLUNKINC. Complexity Drift and Skew Ephemeral Behavior Cloud-compute Elasticity
  • 17.
    © 2020 SPLUNKINC. Murphy’s for Observability #5 Things get worse under pressure
  • 18.
    © 2020 SPLUNKINC. All about scale
  • 19.
    © 2020 SPLUNKINC. • Kubernetes objects • Backend services • Deployed microservices • Frequency of deployments • Dimensions (e.g. pod labels) and high-cardinality • Streaming vs batch & query analytics • Alerting on multiple metric time series Image source: https://github.com/kubernetes/community/blob/master/ sig-scalability/configs-and-limits/thresholds.md The Scalability Envelope System scale is multi-dimensional
  • 20.
    © 2020 SPLUNKINC. Murphy’s for Observability #6 If it is not in the computer, it doesn’t exist
  • 21.
    © 2020 SPLUNKINC. Sampling No Sampling
  • 22.
    © 2020 SPLUNKINC. Murphy’s for Observability #7 Availability is a function of time
  • 23.
    © 2 02 0 S P L U N K I N C . The resolution and speed of the data directly impact the insights you gain
  • 24.
    © 2 02 0 S P L U N K I N C . Interchangeable? • Accuracy is that the measure is correct • Precise means it is consistent with other measurements Observability depends on both But aggregation and analysis can skew this Discussing accuracy and precision
  • 25.
    © 2020 SPLUNKINC. Missing the point 10 sec average =13.9 95% = 27.05 First 5 sec average =16.4 95% = 29.2 Second 5 sec average =11.4 95% = 19.4
  • 26.
    © 2020 SPLUNKINC. Data resolution ≠ Reporting resolution • But both can be problematic • Always deliver all data points regardless of reporting • Finer granularity means more potential precision
  • 27.
    © 2020 SPLUNKINC. Murphy’s for Observability #8 If anything cannot go wrong, it will anyway
  • 28.
    © 2020 SPLUNKINC. Facets of Technology Backend Infra. Monitoring Incident Response APM Code Profiling Dashboards Events and Logs Web User Mobile User On-prem servers Cloud Network VM Container Serverless Packaged Apps Microservices Supply Chain Online Services Digital Experience Frontend Synthetic Monitoring RUM Synthetics User Monitoring Endpoint Monitoring Environments Aggregation, Analysis, Visualization, Response Network Performance Monitoring
  • 29.
    © 2020 SPLUNKINC. Murphy’s for Observability #9 Whenever you set out to do something, something else must be done first.
  • 30.
    © 2020 SPLUNKINC. From Observability 1.0 to 2.0 Thanks to Kevin Brockhoff
  • 31.
    © 2020 SPLUNKINC. What is OpenTelemetry? OpenCensus + = OpenTelemetry: the next major version of both OpenTracing and OpenCensus
  • 32.
    © 2020 SPLUNKINC. Ashley-Perry Statistical Axiom Numbers are tools, not rules
  • 33.
    © 2020 SPLUNKINC. Predictive behavior Sometimes you want to know what’s coming • Prediction is only as good as the data precision and accuracy • Historic versus Sudden Change • (Trend) Stationary • Expect false positives (and negatives)
  • 34.
    © 2020 SPLUNKINC. Baker’s Law Misery no longer loves company. Now it insists on it
  • 35.
    © 2020 SPLUNKINC. Hills Commentaries • If we lose much by having things go wrong, take all possible care • If we have nothing to lose by change, relax • If we have everything to gain by change, relax • If it doesn’t matter, it does not matter McAllister Corollary: Until it does
  • 36.
    © 2020 SPLUNKINC. Murphy’s for Observability #10 All’s well that ends
  • 37.
    Thank You © 2020SPLUNK INC. https://wwww.linkedin.com/in/davemc