© 2018 SPLUNK INC.© 2018 SPLUNK INC.
How to Move From Monitoring
to Observability
Observability: the disingenuous rebranding of monitoring?
Dr. Siyka Andreeva | IT Operations Analytics Specialist
Marc Serieys | Staff Sales Engineer
June 2019
© 2018 SPLUNK INC.
Forward Looking Statements
During the course of this presentation, we may make forward-looking statements regarding future events or
the expected performance of the company. We caution you that such statements reflect our current
expectations and estimates based on factors currently known to us and that actual events or results could
differ materially. For important factors that may cause actual results to differ from those contained in our
forward-looking statements, please review our filings with the SEC.
The forward-looking statements made in this presentation are being made as of the time and date of its live
presentation. If reviewed after its live presentation, this presentation may not contain current or accurate
information. We do not assume any obligation to update any forward-looking statements we may make. In
addition, any information about our roadmap outlines our general product direction and is subject to change
at any time without notice. It is for informational purposes only and shall not be incorporated into any contract
or other commitment. Splunk undertakes no obligation either to develop the features or functionality
described or to include any such feature or functionality in a future release.
Splunk, Splunk>, Listen to Your Data, The Engine for Machine Data, Splunk Cloud, Splunk Light and SPL are trademarks and registered trademarks of Splunk Inc. in the United States and other countries. All other
brand names, product names, or trademarks belong to their respective owners. © 2017 Splunk Inc. All rights reserved.
© 2018 SPLUNK INC.
Agenda
What is observability ? And how it differs from monitoring?
Why is observability even a bigger challenge in a multi-cloud and containerized world?
How Splunk can help?
© 2018 SPLUNK INC.
What is
Observability?
the disingenuous rebranding of monitoring ?
monitoring on steroids?
DevOpsifying monitoring?
© 2018 SPLUNK INC.
Observability…the word starts spreading
because failure is shifting to application code and in production system behavior
© 2018 SPLUNK INC.
Why the word starts spreading ?
IT Operations monitoring challenges are getting worth in a distributed world:
• IT teams know that something is not working -- but not exactly why it’s not working
• Repetitive, manual processes for reactive troubleshooting
• Inability to get to root cause quickly
• Siloed analysis of logs, traces, and metrics
Management Expectations:
• Avoid financial impact from fewer system outages
• Accelerate investigation of application performance and system incidents with real-time log and metric analysis
• Consolidate operational tools and/or external services into one observability tool
• Improve collaboration across teams with targeted alerting and tailored visualization increases collaboration across teams
Same for Dev teams:
• Gap between perception and the reality
• Dev teams spending too much time observing the dev and pre prod env and not prod
© 2018 SPLUNK INC.
Why observability (in IT) ?
Source Wikipedia
Survivorship bias or survival bias is the logical error of concentrating on the people or things that made it
past some selection process and overlooking those that did not, typically because of their lack of visibility. This
can lead to false conclusions in several different ways.
Shot down aircraft don’t
externalize their state
© 2018 SPLUNK INC.
in Software Systems
Input Output
What is observability?
Flow Valv Purity
Velocity Direction Quality
Physical Telemetry
in Industrial Systems
Customer ID Success/Fail $ Spend
Add to cart Checkout Bill/Ship
Logging, Metrics Functions
© 2018 SPLUNK INC.
From monitoring to the three (only three?) pillars of Observability
Inspired from © @copyconstruct
Symptoms
(what’s broken?)
Monitoring
Alerting
Service health Overview
Investigation
Allthetime
Passive
Ops
Causes
(why?)
Debugging
Profiling
(system behavior)
Dependency analysis
(distributed systems tracing infrastructure)
Observability
Onthefly
Reactive
Dev
Events ProfilesPillar
A
Pillar
B
Pillar
C
Pillar
D
LOGS METRICS TRACES
© 2018 SPLUNK INC.
Why is that important in a multi-cloud environment?
2019 trends
Business Logic
Monolithic
Architecture
Billing
Driver mgntUser mgnt
PaymentNotification
User
API
Driver
Trip mgnt
Microservices Architecture
User
API Gateway
Driver
Container
User mgnt
Container
Billing
Container
Notification
Container
Payment
Container
Driver mgnt
Container
Trip mgnt
Microservices
Business Intelligence
Legacy systems
Frontend
Storage
Compute
Security
?
Multi-Cloud
Hardware
OS
Libraries
App.
Bare metal
Hardware
Hypervisor
OS
Lib
App
OS
Lib
App
OS
Lib
App
Virtual
Machines
Hardware
OS
Container Mgr
Lib
App
Lib
App
Lib
App
Containers
Lib
App
Lib
App
Lib
App
Hardware
OS
Libraries
App Mgr
App AppApp
Serverless
(functions)
App AppApp
App AppApp
App AppApp
App AppApp
Containers / Kubernetes / Serverless
Observability in the distributed (and ephemeral)
systems/cloud space is non-negotiable
Distributed location / responsibilities Distributed systems/code
© 2018 SPLUNK INC.
Customer experience???
SAAS
What happens when we stack them? How does this
apply to you and your Ops teams?
ON PREMISES
Legacy systems
(Mainframe…)
Facilities
Dev/PreProd
Storage
Backup
Archive
DR
Security
VMs
Containers Micro
services
AWS (Application 1)Access / Security
Database
StorageDev
Compute
Containers
App engine
GCP
(Big Data project 1)
Dataflow
AWS
(Archive) Azure (Application 1)
VMs
Database
VM sets
Traffic mger
© 2018 SPLUNK INC.
Customer experience???
SAAS
The consequence: only green lights in the war room
ON PREMISES
Legacy systems
(Mainframe…)
Facilities
Dev/PreProd
Storage
Backup
Archive
DR
Security
VMs
Containers Micro
services
AWS (Application 1)Access / Security
Database
StorageDev
Compute
Containers
App engine
GCP
(Big Data project 1)
Dataflow
AWS
(Archive) Azure (Application 1)
VMs
Database
VM sets
Traffic mger
Cx
O
BLO
SAAS
CISO
DevSysAdmin
MKT
??
?
? ?
© 2018 SPLUNK INC.
Splunk for IT
Operations
How do we help with
Observability everywhere?
© 2018 SPLUNK INC.
A market leader
ITOM IT Operations Management
Tools to manage provisioning, capacity,
performance and availability of IT
OBSERVE
ITOA IT operations analytics
DECIDE
Practice of monitoring systems, and
gathering, processing, analyzing &
interpreting data from ITOps sources to
guide decisions & predict issues
AIOps
ACCELERATE
AIOps platforms enhance IT operations
through greater insights by combining
big data, machine learning and
visualization.
SIEM
PROTECT
security event information management)
#1
#2#1
SECURITY IT OPERATIONS
Sources: IDC and/or Gartner
#2
© 2018 SPLUNK INC.
We reached the limits of the traditional approach
Traditional Data Types
Not future proof
Complex
Never Change!
Untapped IT-generated
machine data
(logs, metrics, wired data…)
Machine data is messy and unpredictable
Requires massive scale
You don’t always know which questions to ask
80%
© 2018 SPLUNK INC.
NotconsumablebyhumansConsumablebyhumans
Industry Leading Platform For Machine Data
Online
ServicesNetworks
Security
Call Detail
Records
Web
Services
Telecoms
Web
Clickstreams
Tracing
Online
Shopping Cart
Smartphones
and Devices
Custom
Applications
Energy Meters
Storage
Public
Cloud Private
Cloud
Containers
On-Premises
Servers
GPS
Location
RFID
Packaged
ApplicationsDatabases MessagingFirewall
Logs Wired DB Mobile IoT APIMetrics
DATA
Any Amount
Any Location
Any Source
No need to “adapt or
structure” the data
No database
No need to filter data
SPLUNKBASE 1600+ Free Apps/add-ons
SPLUNK PLATFORM Custom
dashboards
Report &
analyze
Monitor
and alert
Developer
Platform
Ad hoc
search
On-prem or cloud
PREMIUM APPS “data scientist in a box”
IT Ops, DevOps Security Business Analytics, IoT
Different people asking different questions on the same data, in real time
3rd Party
Phantom Orchestration
VictorOps Collaboration
CMDB,
SNOW…
Data lake
APM
Traces
APM
Tracing
© 2018 SPLUNK INC.
Structure Machine data
= fighting a losing battle
© 2018 SPLUNK INC.
How to find a needle in multiple haystacks?
(choose your tool)
Network?
Database?
Middleware?
Hardware?
Wrong
command?
Connection?
Apache?
VM?
Mainframe?
Load
balancer?Wrong code
released?
Collect ALL data
• Collect from all silos
• Data in original raw format
• Add open sources apps to
ingest data on the fly
• Schema on the fly
• Dynamic thresholding
• Realtime correlation
Clustering & aggregation
• Real time event
clustering/correlation
• Reduce alert noise
• Behavioural analytics
• Deduplication
Add context
• Measure / report on
indicators that matters
• Add service / business
context
• Add actionable
information to detection
Salessso
Claims
Anomaly detection
• Catch issues that thresholds
cannot
• Reduce event clutter
• Deviation from past
behaviour
• Deviation from peers
• Unusual change in features
Assisted deep dive
investigation
• Root cause analysis
• Powerful & easy to use
search & investigate
language
?
Predictive
Analytics
• Predict service health
• Predict events
• Trend forecasting
• Detect influencing
entities
• Early warning of
failure
70% to 90%
Reduction in investigation time
15% to 45%
Reduction in high priority incidents
67% to 82%
Reduction in business
impact
© 2018 SPLUNK INC.
UnknownKnown
Awareness/DataAvailable
Knowns Unknowns
Understanding
Observability with Splunk
Known Knowns
(Known problem & solution)
Unknown Knowns
(didn’t realize but clear solution)
Known Unknowns
(we see the problem, not the solution)
Unknown Unknowns
(no idea it’ll happen)
Improve the Known-
Knowns
Dynamic thresholding,
automation, schema on fly,
real time dashboards…
Provide auto correlations, real
time search’s, analytics,
business process mining…
See the Known-
Unknowns
Discover the
Unknown-Knowns
Anomaly detection, predictive IT… Ingest any data, ask any question,
get answers in real time…
Explore the
Unknown-Unknowns
© 2018 SPLUNK INC.
Answer new questions, find new unknowns
Observe | Monitor | Analyze | Act
© 2018 SPLUNK INC.
It’s a journey
Search & Monitor
(Any) Data collection
Real time
monitoring/observability
Centralized Machine Data
Search
Business Insights
Business KPIs
Insights to drive experienceOperational visibility
Service Oriented View
Root Cause analysis
Stabilize IT
Predict & Improve
Predict issues
Recommend actions based
on prior behaviors
Increase MTBF
© 2018 SPLUNK INC.© 2018 SPLUNK INC.
Thank you

How to Move from Monitoring to Observability, On-Premises and in a Multi-Cloud Environment

  • 1.
    © 2018 SPLUNKINC.© 2018 SPLUNK INC. How to Move From Monitoring to Observability Observability: the disingenuous rebranding of monitoring? Dr. Siyka Andreeva | IT Operations Analytics Specialist Marc Serieys | Staff Sales Engineer June 2019
  • 2.
    © 2018 SPLUNKINC. Forward Looking Statements During the course of this presentation, we may make forward-looking statements regarding future events or the expected performance of the company. We caution you that such statements reflect our current expectations and estimates based on factors currently known to us and that actual events or results could differ materially. For important factors that may cause actual results to differ from those contained in our forward-looking statements, please review our filings with the SEC. The forward-looking statements made in this presentation are being made as of the time and date of its live presentation. If reviewed after its live presentation, this presentation may not contain current or accurate information. We do not assume any obligation to update any forward-looking statements we may make. In addition, any information about our roadmap outlines our general product direction and is subject to change at any time without notice. It is for informational purposes only and shall not be incorporated into any contract or other commitment. Splunk undertakes no obligation either to develop the features or functionality described or to include any such feature or functionality in a future release. Splunk, Splunk>, Listen to Your Data, The Engine for Machine Data, Splunk Cloud, Splunk Light and SPL are trademarks and registered trademarks of Splunk Inc. in the United States and other countries. All other brand names, product names, or trademarks belong to their respective owners. © 2017 Splunk Inc. All rights reserved.
  • 3.
    © 2018 SPLUNKINC. Agenda What is observability ? And how it differs from monitoring? Why is observability even a bigger challenge in a multi-cloud and containerized world? How Splunk can help?
  • 4.
    © 2018 SPLUNKINC. What is Observability? the disingenuous rebranding of monitoring ? monitoring on steroids? DevOpsifying monitoring?
  • 5.
    © 2018 SPLUNKINC. Observability…the word starts spreading because failure is shifting to application code and in production system behavior
  • 6.
    © 2018 SPLUNKINC. Why the word starts spreading ? IT Operations monitoring challenges are getting worth in a distributed world: • IT teams know that something is not working -- but not exactly why it’s not working • Repetitive, manual processes for reactive troubleshooting • Inability to get to root cause quickly • Siloed analysis of logs, traces, and metrics Management Expectations: • Avoid financial impact from fewer system outages • Accelerate investigation of application performance and system incidents with real-time log and metric analysis • Consolidate operational tools and/or external services into one observability tool • Improve collaboration across teams with targeted alerting and tailored visualization increases collaboration across teams Same for Dev teams: • Gap between perception and the reality • Dev teams spending too much time observing the dev and pre prod env and not prod
  • 7.
    © 2018 SPLUNKINC. Why observability (in IT) ? Source Wikipedia Survivorship bias or survival bias is the logical error of concentrating on the people or things that made it past some selection process and overlooking those that did not, typically because of their lack of visibility. This can lead to false conclusions in several different ways. Shot down aircraft don’t externalize their state
  • 8.
    © 2018 SPLUNKINC. in Software Systems Input Output What is observability? Flow Valv Purity Velocity Direction Quality Physical Telemetry in Industrial Systems Customer ID Success/Fail $ Spend Add to cart Checkout Bill/Ship Logging, Metrics Functions
  • 9.
    © 2018 SPLUNKINC. From monitoring to the three (only three?) pillars of Observability Inspired from © @copyconstruct Symptoms (what’s broken?) Monitoring Alerting Service health Overview Investigation Allthetime Passive Ops Causes (why?) Debugging Profiling (system behavior) Dependency analysis (distributed systems tracing infrastructure) Observability Onthefly Reactive Dev Events ProfilesPillar A Pillar B Pillar C Pillar D LOGS METRICS TRACES
  • 10.
    © 2018 SPLUNKINC. Why is that important in a multi-cloud environment? 2019 trends Business Logic Monolithic Architecture Billing Driver mgntUser mgnt PaymentNotification User API Driver Trip mgnt Microservices Architecture User API Gateway Driver Container User mgnt Container Billing Container Notification Container Payment Container Driver mgnt Container Trip mgnt Microservices Business Intelligence Legacy systems Frontend Storage Compute Security ? Multi-Cloud Hardware OS Libraries App. Bare metal Hardware Hypervisor OS Lib App OS Lib App OS Lib App Virtual Machines Hardware OS Container Mgr Lib App Lib App Lib App Containers Lib App Lib App Lib App Hardware OS Libraries App Mgr App AppApp Serverless (functions) App AppApp App AppApp App AppApp App AppApp Containers / Kubernetes / Serverless Observability in the distributed (and ephemeral) systems/cloud space is non-negotiable Distributed location / responsibilities Distributed systems/code
  • 11.
    © 2018 SPLUNKINC. Customer experience??? SAAS What happens when we stack them? How does this apply to you and your Ops teams? ON PREMISES Legacy systems (Mainframe…) Facilities Dev/PreProd Storage Backup Archive DR Security VMs Containers Micro services AWS (Application 1)Access / Security Database StorageDev Compute Containers App engine GCP (Big Data project 1) Dataflow AWS (Archive) Azure (Application 1) VMs Database VM sets Traffic mger
  • 12.
    © 2018 SPLUNKINC. Customer experience??? SAAS The consequence: only green lights in the war room ON PREMISES Legacy systems (Mainframe…) Facilities Dev/PreProd Storage Backup Archive DR Security VMs Containers Micro services AWS (Application 1)Access / Security Database StorageDev Compute Containers App engine GCP (Big Data project 1) Dataflow AWS (Archive) Azure (Application 1) VMs Database VM sets Traffic mger Cx O BLO SAAS CISO DevSysAdmin MKT ?? ? ? ?
  • 13.
    © 2018 SPLUNKINC. Splunk for IT Operations How do we help with Observability everywhere?
  • 14.
    © 2018 SPLUNKINC. A market leader ITOM IT Operations Management Tools to manage provisioning, capacity, performance and availability of IT OBSERVE ITOA IT operations analytics DECIDE Practice of monitoring systems, and gathering, processing, analyzing & interpreting data from ITOps sources to guide decisions & predict issues AIOps ACCELERATE AIOps platforms enhance IT operations through greater insights by combining big data, machine learning and visualization. SIEM PROTECT security event information management) #1 #2#1 SECURITY IT OPERATIONS Sources: IDC and/or Gartner #2
  • 15.
    © 2018 SPLUNKINC. We reached the limits of the traditional approach Traditional Data Types Not future proof Complex Never Change! Untapped IT-generated machine data (logs, metrics, wired data…) Machine data is messy and unpredictable Requires massive scale You don’t always know which questions to ask 80%
  • 16.
    © 2018 SPLUNKINC. NotconsumablebyhumansConsumablebyhumans Industry Leading Platform For Machine Data Online ServicesNetworks Security Call Detail Records Web Services Telecoms Web Clickstreams Tracing Online Shopping Cart Smartphones and Devices Custom Applications Energy Meters Storage Public Cloud Private Cloud Containers On-Premises Servers GPS Location RFID Packaged ApplicationsDatabases MessagingFirewall Logs Wired DB Mobile IoT APIMetrics DATA Any Amount Any Location Any Source No need to “adapt or structure” the data No database No need to filter data SPLUNKBASE 1600+ Free Apps/add-ons SPLUNK PLATFORM Custom dashboards Report & analyze Monitor and alert Developer Platform Ad hoc search On-prem or cloud PREMIUM APPS “data scientist in a box” IT Ops, DevOps Security Business Analytics, IoT Different people asking different questions on the same data, in real time 3rd Party Phantom Orchestration VictorOps Collaboration CMDB, SNOW… Data lake APM Traces APM Tracing
  • 17.
    © 2018 SPLUNKINC. Structure Machine data = fighting a losing battle
  • 18.
    © 2018 SPLUNKINC. How to find a needle in multiple haystacks? (choose your tool) Network? Database? Middleware? Hardware? Wrong command? Connection? Apache? VM? Mainframe? Load balancer?Wrong code released? Collect ALL data • Collect from all silos • Data in original raw format • Add open sources apps to ingest data on the fly • Schema on the fly • Dynamic thresholding • Realtime correlation Clustering & aggregation • Real time event clustering/correlation • Reduce alert noise • Behavioural analytics • Deduplication Add context • Measure / report on indicators that matters • Add service / business context • Add actionable information to detection Salessso Claims Anomaly detection • Catch issues that thresholds cannot • Reduce event clutter • Deviation from past behaviour • Deviation from peers • Unusual change in features Assisted deep dive investigation • Root cause analysis • Powerful & easy to use search & investigate language ? Predictive Analytics • Predict service health • Predict events • Trend forecasting • Detect influencing entities • Early warning of failure 70% to 90% Reduction in investigation time 15% to 45% Reduction in high priority incidents 67% to 82% Reduction in business impact
  • 19.
    © 2018 SPLUNKINC. UnknownKnown Awareness/DataAvailable Knowns Unknowns Understanding Observability with Splunk Known Knowns (Known problem & solution) Unknown Knowns (didn’t realize but clear solution) Known Unknowns (we see the problem, not the solution) Unknown Unknowns (no idea it’ll happen) Improve the Known- Knowns Dynamic thresholding, automation, schema on fly, real time dashboards… Provide auto correlations, real time search’s, analytics, business process mining… See the Known- Unknowns Discover the Unknown-Knowns Anomaly detection, predictive IT… Ingest any data, ask any question, get answers in real time… Explore the Unknown-Unknowns
  • 20.
    © 2018 SPLUNKINC. Answer new questions, find new unknowns Observe | Monitor | Analyze | Act
  • 21.
    © 2018 SPLUNKINC. It’s a journey Search & Monitor (Any) Data collection Real time monitoring/observability Centralized Machine Data Search Business Insights Business KPIs Insights to drive experienceOperational visibility Service Oriented View Root Cause analysis Stabilize IT Predict & Improve Predict issues Recommend actions based on prior behaviors Increase MTBF
  • 22.
    © 2018 SPLUNKINC.© 2018 SPLUNK INC. Thank you