THAT’S NOT A METRIC!
DATA FOR CLOUD-NATIVE SUCCESS
GORDON HAFF
Technology Evangelist, Red Hat
LC3 China 2017
@ghaff
PRINCIPLES
“Without data you’re just a person with an
opinion.”
- W. Edwards Deming
“ Implicit in the phrase “big
data,” as well as the concept of
data as gold, is that more is
better. But in the case of
analytics, a legitimate question
worth considering: Is more data
really better?”
- Bob O’Donnell
“ You can’t pick your data, but
you must pick your metrics.”
- Jeff Bladt and Bob Filbin
“ A familiar phrase on the turf
is 'horses for courses.’”
- Unknown British writer,
1898
“Human beings adjust behavior
based on the metrics they’re
held against. Anything you
measure will impel a person to
optimize his score on that
metric. What you measure is
what you’ll get. Period.”
THE PRINCIPLES
● You need to measure
● You need to choose relevant metrics
● Quantity may not lead to quality
● Different measurements serve different purposes
● Measurements drive behaviors
LENSES
BUSINESS
Customer satisfaction
Shopping cart abandons
Employee turnover
OPERATIONS
Cluster health
Utilization
Outages
DEVELOPERS
“Productivity”
Test coverage
Time to deploy
AUDIENCE
PEOPLE
Turnover
Capability
Response time
PROCESS
Effectiveness
Efficiency
Deployment frequency
TECHNOLOGY
Performance
Failure rate
Uptime
PEOPLE, PROCESS, AND TECHNOLOGY
Hat tip to Chris Riley on DevOps.com
BUSINESS
SUCCESS
Churn
Conversion rates
Avg revenue per user
CUSTOMER
EXPERIENCE
Customer satisfaction
Frequency of visits
A/B test results
APPLICATION
PERFORMANCE
Application response
Database query time
Uptime
FUNCTIONAL GOALS (NEW RELIC)
SPEED
Lead time for changes
Code release frequency
Mean time to resolution
QUALITY
Deployment success rate
Incident severity
Outstanding bugs
DATA
4 RULES FOR DATA
● Instrument (many/most of) the things
● Root cause analysis (reactive)
● Detect patterns/trends (proactive)
● Context and distributions matter
WHAT DO WE MEASURE AND STORE?
● Most things
● Unexamined data has negative ROI
● General trend toward keeping data
“forever”
Give it two years and
everything will be stored.
—Harel Kodesh, GE Digital CTO
300GB of data per engine
per flight
SOME DIRECTIONS
● Increased use of statistics and machine learning
(eyeballing dashboards doesn’t scale)
● Better understand how data interacts (latency
affects page load affects customer conversion
affects revenue)
● Context (seasonal patterns are OK)
● Bottom line: Find patterns that don't conform to
expected behavior (anomolies 101)
LOGGING: EFK STACK
● ElasticSearch, Fluentd, Kibana
● Collect, index, search, and visualize
log data
● Good for ad hoc analytics
● Good for post mortem forensics
because of extensive log information
● Fluentd can serve as integration
point between cloud native software
like Kubernetes and Prometheus
MONITORING: PROMETHEUS
● Time series data model identified by
metric name and key/value pairs
● Collection happens via a pull model over
HTTP
● Values reliability even under failure
conditions over 100% accuracy
● Most associated with web-scale
DevSecOps
MONITORING: HAWKULAR
● REST API to store and retrieve
availability, counter, and gauge
measurements
● Visualization and alerting
● Application performance management
● Integration with ManageIQ (cloud mgmt)
● Most associated with large scale central
IT teams with lots of apps
ALARMS
4 RULES FOR ALARMS
● Exciting, not routine
● Something needs to be fixed. Now.
● No ambers!
● Must reach the right people
ALARM FATIGUE IS A THING
WHICH OF THE FOLLOWING SHOULD WAKE UP AN
EXPENSIVE ENGINEER AT 2AM?
A: Based on current trends, we need to add additional
capacity within 2 weeks
B: A hardware failure led to a successful cluster failover
C: Response time has increased by 20%
D: Our customer support site is down because of an
AWS-East outage
D: Our customer support
site is down because of an
AWS-East outage
MEASUREMENTS
AREN’T
METRICS
4 RULES FOR METRICS
● What’s important to you? (Success criteria)
● Tied to business outcomes
● Traceable to root cause(s)
● Not too many!
SELECTED PAYPAL METRICS
WHAT
% of failed deployments
Customer ticket volume
Response time
Deployment frequency
Change volume
SELECTED PAYPAL METRICS
WHAT WHY
% of failed deployments Dysfunction in deployment pipeline
Customer ticket volume Basic customer satisfaction measure
Response time Service operating within thresholds
Deployment frequency Faster iterations for new code
Change volume User stories/new lines of code
PUPPET LABS METRICS
● Deployment (or change) frequency
● Change lead time
● Change failure rate
● Mean Time to Recover
RED HAT OPENSHIFT ONLINE METRICS
● Number of applications
● Efficiency (cost)
● Response time (various measures)
● Uptime
GARTNER: DEVOPS METRICS
Source: Gartner
Data-Driven DevOps: Use Metrics to Help Guide Your Journey
May 2014
METRICS ANTI-PATTERNS
ANTI-PATTERN WARNING SIGNS
● Easy to collect but don’t really
mean anything
● Drive lack of cooperation
● Not observable or not
actionable
● Not aligned with business
objectives
WHAT MATTERS TO YOU?
What do you want to optimize for?
Customers, cost, speed…?
SUMMARY
● Measurements matter
● They’re not metrics
● Metrics are about your success factors
● Do you need to wake someone up?
● New open source tooling (but early)
THANK YOU
plus.google.com/+RedHat
linkedin.com/company/red-hat
youtube.com/user/RedHatVideos
facebook.com/redhatinc
twitter.com/RedHatNews
40
CREDITS
Lens porn: Ash https://www.flickr.com/photos/neothermic/3485301339
Piggy bank: https://www.flickr.com/photos/marcmos/3644751092
Horse racing: https://www.flickr.com/photos/rogerbarker/2881596967
Report card: https://www.flickr.com/photos/richardgiles/3835758300
Traffic light: https://www.flickr.com/photos/96dpi/3124912138/
Air traffic: NATS - UK air traffic control
Sleeping: https://www.flickr.com/photos/barkbud/4126277314/

That's not a metric! Data for cloud-native success

  • 1.
    THAT’S NOT AMETRIC! DATA FOR CLOUD-NATIVE SUCCESS GORDON HAFF Technology Evangelist, Red Hat LC3 China 2017 @ghaff
  • 2.
  • 3.
    “Without data you’rejust a person with an opinion.” - W. Edwards Deming
  • 4.
    “ Implicit inthe phrase “big data,” as well as the concept of data as gold, is that more is better. But in the case of analytics, a legitimate question worth considering: Is more data really better?” - Bob O’Donnell
  • 5.
    “ You can’tpick your data, but you must pick your metrics.” - Jeff Bladt and Bob Filbin
  • 6.
    “ A familiarphrase on the turf is 'horses for courses.’” - Unknown British writer, 1898
  • 7.
    “Human beings adjustbehavior based on the metrics they’re held against. Anything you measure will impel a person to optimize his score on that metric. What you measure is what you’ll get. Period.”
  • 8.
    THE PRINCIPLES ● Youneed to measure ● You need to choose relevant metrics ● Quantity may not lead to quality ● Different measurements serve different purposes ● Measurements drive behaviors
  • 9.
  • 10.
    BUSINESS Customer satisfaction Shopping cartabandons Employee turnover OPERATIONS Cluster health Utilization Outages DEVELOPERS “Productivity” Test coverage Time to deploy AUDIENCE
  • 11.
  • 12.
    BUSINESS SUCCESS Churn Conversion rates Avg revenueper user CUSTOMER EXPERIENCE Customer satisfaction Frequency of visits A/B test results APPLICATION PERFORMANCE Application response Database query time Uptime FUNCTIONAL GOALS (NEW RELIC) SPEED Lead time for changes Code release frequency Mean time to resolution QUALITY Deployment success rate Incident severity Outstanding bugs
  • 13.
  • 14.
    4 RULES FORDATA ● Instrument (many/most of) the things ● Root cause analysis (reactive) ● Detect patterns/trends (proactive) ● Context and distributions matter
  • 15.
    WHAT DO WEMEASURE AND STORE? ● Most things ● Unexamined data has negative ROI ● General trend toward keeping data “forever” Give it two years and everything will be stored. —Harel Kodesh, GE Digital CTO 300GB of data per engine per flight
  • 16.
    SOME DIRECTIONS ● Increaseduse of statistics and machine learning (eyeballing dashboards doesn’t scale) ● Better understand how data interacts (latency affects page load affects customer conversion affects revenue) ● Context (seasonal patterns are OK) ● Bottom line: Find patterns that don't conform to expected behavior (anomolies 101)
  • 17.
    LOGGING: EFK STACK ●ElasticSearch, Fluentd, Kibana ● Collect, index, search, and visualize log data ● Good for ad hoc analytics ● Good for post mortem forensics because of extensive log information ● Fluentd can serve as integration point between cloud native software like Kubernetes and Prometheus
  • 18.
    MONITORING: PROMETHEUS ● Timeseries data model identified by metric name and key/value pairs ● Collection happens via a pull model over HTTP ● Values reliability even under failure conditions over 100% accuracy ● Most associated with web-scale DevSecOps
  • 19.
    MONITORING: HAWKULAR ● RESTAPI to store and retrieve availability, counter, and gauge measurements ● Visualization and alerting ● Application performance management ● Integration with ManageIQ (cloud mgmt) ● Most associated with large scale central IT teams with lots of apps
  • 20.
  • 21.
    4 RULES FORALARMS ● Exciting, not routine ● Something needs to be fixed. Now. ● No ambers! ● Must reach the right people
  • 22.
  • 23.
    WHICH OF THEFOLLOWING SHOULD WAKE UP AN EXPENSIVE ENGINEER AT 2AM? A: Based on current trends, we need to add additional capacity within 2 weeks B: A hardware failure led to a successful cluster failover C: Response time has increased by 20% D: Our customer support site is down because of an AWS-East outage
  • 25.
    D: Our customersupport site is down because of an AWS-East outage
  • 26.
  • 27.
    4 RULES FORMETRICS ● What’s important to you? (Success criteria) ● Tied to business outcomes ● Traceable to root cause(s) ● Not too many!
  • 28.
    SELECTED PAYPAL METRICS WHAT %of failed deployments Customer ticket volume Response time Deployment frequency Change volume
  • 29.
    SELECTED PAYPAL METRICS WHATWHY % of failed deployments Dysfunction in deployment pipeline Customer ticket volume Basic customer satisfaction measure Response time Service operating within thresholds Deployment frequency Faster iterations for new code Change volume User stories/new lines of code
  • 30.
    PUPPET LABS METRICS ●Deployment (or change) frequency ● Change lead time ● Change failure rate ● Mean Time to Recover
  • 31.
    RED HAT OPENSHIFTONLINE METRICS ● Number of applications ● Efficiency (cost) ● Response time (various measures) ● Uptime
  • 32.
    GARTNER: DEVOPS METRICS Source:Gartner Data-Driven DevOps: Use Metrics to Help Guide Your Journey May 2014
  • 33.
  • 34.
    ANTI-PATTERN WARNING SIGNS ●Easy to collect but don’t really mean anything ● Drive lack of cooperation ● Not observable or not actionable ● Not aligned with business objectives
  • 36.
    WHAT MATTERS TOYOU? What do you want to optimize for? Customers, cost, speed…?
  • 37.
  • 38.
    ● Measurements matter ●They’re not metrics ● Metrics are about your success factors ● Do you need to wake someone up? ● New open source tooling (but early)
  • 39.
  • 40.
    40 CREDITS Lens porn: Ashhttps://www.flickr.com/photos/neothermic/3485301339 Piggy bank: https://www.flickr.com/photos/marcmos/3644751092 Horse racing: https://www.flickr.com/photos/rogerbarker/2881596967 Report card: https://www.flickr.com/photos/richardgiles/3835758300 Traffic light: https://www.flickr.com/photos/96dpi/3124912138/ Air traffic: NATS - UK air traffic control Sleeping: https://www.flickr.com/photos/barkbud/4126277314/