That's not a metric! Data for cloud-native success

THAT’S NOT A METRIC!
DATA FOR CLOUD-NATIVE SUCCESS
GORDON HAFF
Technology Evangelist, Red Hat
LC3 China 2017
@ghaff

“Without data you’re just a person with an
opinion.”
- W. Edwards Deming

“ Implicit in the phrase “big
data,” as well as the concept of
data as gold, is that more is
better. But in the case of
analytics, a legitimate question
worth considering: Is more data
really better?”
- Bob O’Donnell

“ You can’t pick your data, but
you must pick your metrics.”
- Jeff Bladt and Bob Filbin

“ A familiar phrase on the turf
is 'horses for courses.’”
- Unknown British writer,
1898

“Human beings adjust behavior
based on the metrics they’re
held against. Anything you
measure will impel a person to
optimize his score on that
metric. What you measure is
what you’ll get. Period.”

THE PRINCIPLES
● You need to measure
● You need to choose relevant metrics
● Quantity may not lead to quality
● Different measurements serve different purposes
● Measurements drive behaviors

BUSINESS
Customer satisfaction
Shopping cart abandons
Employee turnover
OPERATIONS
Cluster health
Utilization
Outages
DEVELOPERS
“Productivity”
Test coverage
Time to deploy
AUDIENCE

PEOPLE
Turnover
Capability
Response time
PROCESS
Effectiveness
Efficiency
Deployment frequency
TECHNOLOGY
Performance
Failure rate
Uptime
PEOPLE, PROCESS, AND TECHNOLOGY
Hat tip to Chris Riley on DevOps.com

BUSINESS
SUCCESS
Churn
Conversion rates
Avg revenue per user
CUSTOMER
EXPERIENCE
Customer satisfaction
Frequency of visits
A/B test results
APPLICATION
PERFORMANCE
Application response
Database query time
Uptime
FUNCTIONAL GOALS (NEW RELIC)
SPEED
Lead time for changes
Code release frequency
Mean time to resolution
QUALITY
Deployment success rate
Incident severity
Outstanding bugs

4 RULES FOR DATA
● Instrument (many/most of) the things
● Root cause analysis (reactive)
● Detect patterns/trends (proactive)
● Context and distributions matter

WHAT DO WE MEASURE AND STORE?
● Most things
● Unexamined data has negative ROI
● General trend toward keeping data
“forever”
Give it two years and
everything will be stored.
—Harel Kodesh, GE Digital CTO
300GB of data per engine
per flight

SOME DIRECTIONS
● Increased use of statistics and machine learning
(eyeballing dashboards doesn’t scale)
● Better understand how data interacts (latency
affects page load affects customer conversion
affects revenue)
● Context (seasonal patterns are OK)
● Bottom line: Find patterns that don't conform to
expected behavior (anomolies 101)

LOGGING: EFK STACK
● ElasticSearch, Fluentd, Kibana
● Collect, index, search, and visualize
log data
● Good for ad hoc analytics
● Good for post mortem forensics
because of extensive log information
● Fluentd can serve as integration
point between cloud native software
like Kubernetes and Prometheus

MONITORING: PROMETHEUS
● Time series data model identified by
metric name and key/value pairs
● Collection happens via a pull model over
HTTP
● Values reliability even under failure
conditions over 100% accuracy
● Most associated with web-scale
DevSecOps

MONITORING: HAWKULAR
● REST API to store and retrieve
availability, counter, and gauge
measurements
● Visualization and alerting
● Application performance management
● Integration with ManageIQ (cloud mgmt)
● Most associated with large scale central
IT teams with lots of apps

4 RULES FOR ALARMS
● Exciting, not routine
● Something needs to be fixed. Now.
● No ambers!
● Must reach the right people

WHICH OF THE FOLLOWING SHOULD WAKE UP AN
EXPENSIVE ENGINEER AT 2AM?
A: Based on current trends, we need to add additional
capacity within 2 weeks
B: A hardware failure led to a successful cluster failover
C: Response time has increased by 20%
D: Our customer support site is down because of an
AWS-East outage

D: Our customer support
site is down because of an
AWS-East outage

4 RULES FOR METRICS
● What’s important to you? (Success criteria)
● Tied to business outcomes
● Traceable to root cause(s)
● Not too many!

SELECTED PAYPAL METRICS
WHAT
% of failed deployments
Customer ticket volume
Response time
Deployment frequency
Change volume

SELECTED PAYPAL METRICS
WHAT WHY
% of failed deployments Dysfunction in deployment pipeline
Customer ticket volume Basic customer satisfaction measure
Response time Service operating within thresholds
Deployment frequency Faster iterations for new code
Change volume User stories/new lines of code

PUPPET LABS METRICS
● Deployment (or change) frequency
● Change lead time
● Change failure rate
● Mean Time to Recover

RED HAT OPENSHIFT ONLINE METRICS
● Number of applications
● Efficiency (cost)
● Response time (various measures)
● Uptime

GARTNER: DEVOPS METRICS
Source: Gartner
Data-Driven DevOps: Use Metrics to Help Guide Your Journey
May 2014

ANTI-PATTERN WARNING SIGNS
● Easy to collect but don’t really
mean anything
● Drive lack of cooperation
● Not observable or not
actionable
● Not aligned with business
objectives

WHAT MATTERS TO YOU?
What do you want to optimize for?
Customers, cost, speed…?

● Measurements matter
● They’re not metrics
● Metrics are about your success factors
● Do you need to wake someone up?
● New open source tooling (but early)

THANK YOU
plus.google.com/+RedHat
linkedin.com/company/red-hat
youtube.com/user/RedHatVideos
facebook.com/redhatinc
twitter.com/RedHatNews

40
CREDITS
Lens porn: Ash https://www.flickr.com/photos/neothermic/3485301339
Piggy bank: https://www.flickr.com/photos/marcmos/3644751092
Horse racing: https://www.flickr.com/photos/rogerbarker/2881596967
Report card: https://www.flickr.com/photos/richardgiles/3835758300
Traffic light: https://www.flickr.com/photos/96dpi/3124912138/
Air traffic: NATS - UK air traffic control
Sleeping: https://www.flickr.com/photos/barkbud/4126277314/

That's not a metric! Data for cloud-native success

More Related Content

What's hot

Similar to That's not a metric! Data for cloud-native success

More from Gordon Haff

Recently uploaded

That's not a metric! Data for cloud-native success