© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Josh Evans - Director of Operations Engineering
Engineering
Netflix Global Operations
in the Cloud
Internet
• Two Operational Challenges
• Operational Excellence
• Operations Engineering
Our Journey
Our Journey
• Two Operational Challenges
• Operational Excellence
• Operations Engineering
Product Innovation
winning moments of truth
● Every facet of the product
● 1400 AB tests in the last year & accelerating
Continuous Innovation
Challenge #1:
Accelerate Innovation and Rate of Change
Scale & Complexity
100,000s of requests per second
1000s of Global Starts per Second
Approaching Global Reach
October - Spain, Portugal, Italy
Early 2016 - Korea, Taiwan, Singapore, Hong Kong
65m members  100m
~60 counties  200
EU-WestUS-EastUS-West
Multi-Zone, Multi-Region
Netflix CDN
(Open Connect)
Cloud
Control Plane
Internet
The Bigger Picture
Service
Partners
Service
Partners
Challenge #2:
Sustain & Improve Quality
in the face of ever growing scale & complexity
Our Journey
• Two Operational Challenges
• Operational Excellence
• Operations Engineering
Operational Excellence
Quality Velocity
Availability vs. Rate of Change
Rate of Change
Availability(nines)
6
5
4
3
2
1
0
1 10 100 1000
99.9999%
99.999%
99.99%
99.9%
99%
90%
31.5 seconds
5.26 minutes
52.56 minutes
8.76 hours
3.26 days
36.5 days
Quality vs. Velocity
Availability vs. Rate of Change
Rate of Change
Availability(nines)
6
5
4
3
2
1
0
1 10 100 1000
99.9999%
99.999%
99.99%
99.9%
99%
90%
31.5 seconds
5.26 minutes
52.56 minutes
8.76 hours
3.26 days
36.5 days
The Zero Sum Game
Availability vs. Rate of Change
Rate of Change
Availability(nines)
6
5
4
3
2
1
0
1 10 100 1000
99.9999%
99.999%
99.99%
99.9%
99%
90%
31.5 seconds
5.26 minutes
52.56 minutes
8.76 hours
3.26 days
36.5 days
The Zero Sum Game
Availability vs. Rate of Change
Rate of Change
Availability(nines)
6
5
4
3
2
1
0
1 10 100 1000
99.9999%
99.999%
99.99%
99.9%
99%
90%
Shifting the Curve
Operational Excellence is the continuous improvement
of the management, design, and function of operational
environments to achieve greater quality, velocity, and
competitive advantage.
Our Journey
• Two Operational Challenges
• Operational Excellence
• Operations Engineering
Build It
design
code
build
bake
test
deploy
Run It
operate
configure
monitor
respond
You build it, you run it…
…globally
Undifferentiated
Heavy Lifting
Operations Engineering is the application of software
engineering practices and principles to achieve and sustain
operational excellence.
• automation
• modular components
• tools & services
• best practices
Our Journey – Operations Engineering
• Engineering Tools
• Insight & Real-time Analytics
• Performance & Reliability
• Leverage
Our Journey
• Engineering Tools
• Insight & Real-time Analytics
• Performance & Reliability
• Leverage
Data Center
● Delayed provisioning
● Hand-crafted servers
● Variations and complexity
Our Artisanal Past
Delivery
● Late night, manual deployments
● Repeated mistakes
● Painful delays to production fixes
• productivity
• velocity
• quality
Engineering Tools
• cloud management
• delivery engine
• automation platform
Global Cloud Management
Delivery Pipelines
Automated Global Delivery
The Paved Road
• Stash
• Gradle
• Ubuntu
• Jenkins
• Spinnaker
Our Journey
• Engineering Tools
• Insight & Real-time Analytics
• Performance & Reliability
• Leverage
Insight & Real-Time Analytics
OODA loop
An outage may not be life or death but…
• DES on time series
data
• Predict the future
based on history
• Favor recent history
• Threshold-based alerts
• 6-8 minute delay
Anomaly Detection
Alert!
Finer Granularity, Shorter Time Windows
Ensemble Learning
Median Absolute Deviation
IQR
Least Squares
HDI
Voting
observe, orient, decide, act
Alert!
From 6-8 minutes to < 1 minute
observe, orient…
…decide, act
How do we take humans out of the equation?
Outlier Detection & Remediation
• Unsupervised machine learning
• Density-based clustering
algorithm
• Actions
• Email, page
• OOS, detach, terminate
Kepler
An ounce of prevention…
Old Version (v1.0)
New Version
(v1.1)
Load BalancerCustomers
100 Servers
5 Servers
95%
5%
Metrics
Canary Release Process
Old Version (v1.0)
New Version
(v1.1)
Load BalancerCustomers
0 Servers
100 Servers
100%
Metrics
Canary Release Process
Define
• Metrics
• A threshold
Every n minutes
● Classify metrics
● Compute score
● Make a decision
Automatic Canary Analysis
• Systematic observation of facets & permutations
• Unsupervised monitoring & decision- making
• Automated tuning & recovery
• Alerts with analysis
Thinking Globally
Our Journey
• Engineering Tools
• Insight & Real-time Analytics
• Performance & Reliability
• Leverage
Performance & Reliability
Internet
Zuul
API
NCC
P
Playback
History
Playback Sessions
MAP
Chaos Engineering is the discipline of experimenting on
a distributed system in order to build confidence in the
systems capability to withstand turbulent conditions in
production.
Cluster A Cluster D
Edge Cluster
Cluster B
Cluster C
Imagine a monkey loose in your data center…
Xen Hypervisor vulnerability – 9/25/14
218 out of 2700+ Cassandra nodes rebooted
22 did not reboot successfully
Automation handled the rest
A State of Xen – Chaos Monkey & Cassandra
Device Service B
Service C
Internet EdgeZuul
Service A
ELB
FIT
Fault-Injection Testing (FIT)
• Simulate service failures
• Override by device or account
• % of member traffic
Device Service B
Service C
Internet EdgeZuul
Service A
ELB
FIT
Fault-Injection Testing (FIT)
• Simulate service failures
• Override by device or account
• % of member traffic
US-EastUS-West
AZ1
EU-West
Global Traffic Management
The Internet
DNS-based
Routing
Zuul Proxy
Back Channel
###, ###, ###
• Alerting and Monitoring
• Apache & Tomcat Hardening
• Automated Canary Analysis
• Autoscaling
• Chaos Participation
• Consistent Naming
• ELB Configuration
• Healthcheck Configured
• Red-Black Pipeline
• Squeeze Testing
• Timeout & Fallback Tuning
• Workload Reliability
Production Ready?
Our Journey
• Engineering Tools
• Insight & Real-time Analytics
• Performance & Reliability
• Leverage
● A federation of tools
● Common UI elements
● Deep linking
Operational Tools as a Product
Canary Analysis
Conformity
Integration Tests
Citrus
Chaos
Static
Unit Tests
Deep Integration
Modular Components
Functional
Testing
RTA auto-tuning
• Alerts
• Apache/Tomcat
• Auto-scaling
• Hystrix fallbacks
RTA decision support
• ACA
• Citrus
• Flow
Conformity checks
• Consistent names
• ELBs
• Health check
• Red/black deployment
Delivery integration
• ACA
• Citrus
• FIT
Production Ready – Automation & Integration
Internet
Our Journey Ends
https://netflix.github.io/
Speaker When? Where?
Engineering Netflix Global Operations in the Cloud Josh Evans Wed @11am Palazzo N
Efficient Innovation: High-Velocity Cost Management at Netflix Andrew Park
Wed @
2:45pm
Palazzo C
Netflix Keystone: How Netflix Handles Data Streams Up to 8
Million Events Per Second
Peter Bakas
Wed @
2:45pm
San Polo
3501B
A Day in the Life of a Netflix Engineer Using 37% of the Internet Dave Hahn
Wed @
4:15pm
Venetian H
Availability: The New Kind of Innovator’s Dilemma Coburn Watson
Wed @
4:15pm
Marcello
4501B
Real-Time Analytics In Service of Self-Healing Ecosystems
Roy Rapoport
Chris Sanden
Wed @
4:15pm
Lido 3001B
Running Spark and Presto on the Netflix Big Data Platform Daniel Weeks Thu @ 11am Palazzo F
Splitting the Check on Compliance and Security: Keeping
Developers and Auditors Happy in the Cloud
Jason Chan Thu @ 11am
Marcello
4501B
@
Thank you!
Josh Evans
jevans@netflix.com
@josh_evans_nflx

Engineering Netflix Global Operations in the Cloud