Cobus Bernard
Sr Developer Advocate
Amazon Web Services
Getting Started AWS:
Understanding Disaster Recovery
@cobusbernard
cobusbernard
cobusbernard
Agenda
Define requirements & SPOFs
Choosing recovery method
Backups
Testing your plan
Resiliency and self-healing systems
Using DR as a migration strategy
© 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Initial questions to answer
How important are the applications to your business?
What is the associated recovery point and time for these applications?
How are you storing the data?
Where are you storing the data?
How are you restoring the application?
Protected data
Data changing over time
RPO
t1 Current
Why do we backup data?
Minimize data loss
Liability
Cost
Why do we backup data?
Balance cost with liability
© 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved.
AWS offers four levels of backup and DR support
across a spectrum of complexity and time
• Lower priority use cases
• Solutions: Amazon S3,
AWS Storage Gateway
• Cost: $
• Meeting lower
RTO & RPO requirements
• Core services
• Scale AWS resources
in response to a DR event
• Cost: $$
• Solutions that require
RTO & RPO in minutes
• Business critical services
• Cost: $$$
• Auto-failover of your
environment in AWS
• Cost: $$$$
RPO/RTO:
Hours
RPO/RTO:
10s of Minutes
RPO/RTO:
Minutes
RPO/RTO:
Real-time
Low High
Backup & Restore Pilot light
Warm standby
in AWS
Hot standby
(with multi-site)
AWS Storage Gateway
AWS Backup: centralize compliance, automate
backup, work across services
Amazon EFSAmazon EBS
Amazon RDS Amazon
DynamoDB
AWS Storage
Gateway
AWS Backup
1. Simplified backup
scheduling and lifecycle
management across
AWS services
2. Centrally manage
backup activities,
security, and reporting
3. Achieve consistency and
meet compliance
requirements
Not running
Pilot light
system
Corporate data center
Primary Database
server
Subordinate
database
server
Data
volume
Application
server
Reverse
proxy/
caching
server
AWS Cloud
Pilot light prep
www.example.com
Data mirroring
replication
Reverse proxy/
caching server
Application
server
Reverse proxy/
caching server
Application
server
Start in minutes
Add additional
capacity,
if needed
Corporate data center
Primary Database
server
database
server
Data
volume
Application
server
Reverse
proxy/
caching
server
AWS Cloud
Pilot light recovery
www.example.com
Elastic load
balancing
Route 53
Corporate data center
Data volume
Application
server
Subordinate
database
server
Reverse
proxy/
caching
server
AWS Region
Reverse proxy/
caching server
Application
server
Primary Database
server
AWS CloudWarm standby prep
www.example.com
Mirroring/replication
Application
data source
cut over
Not active for
production
traffic
Scaled down
standby
Reverse proxy/
caching server
Application
server
Subordinate
database
server
Warm standby recover
www.example.com
Corporate data center
Primary Database
server
Elastic load
balancing
Route 53
Data volume
Application
server
Reverse
proxy/
caching
server
Active
Scaled up
production
AWS Region
AWS Cloud
Elastic load
balancing
Route 53
Corporate data center
Data volume
Application
server
Database
server
Reverse
proxy/
caching
server
AWS Region
Primary Database
server
Active
AWS CloudHot site prep
www.example.com
Mirroring/replication
Application
data source
cut over
Reverse proxy/
caching server
Application
server
Elastic load
balancing
Route 53
Corporate data center
Data volume
Application
server
Database
server
Reverse
proxy/
caching
server
Primary Database
server
Active
Scaled up
for production use
AWS CloudHot site recovery
www.example.com
AWS Region
Reverse proxy/
caching server
Application
server
© 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Fire Drills
“Chaos Engineering is the discipline of
experimenting on a distributed system
in order to build confidence in the system’s
capability to withstand turbulent conditions in
production.”
http://principlesofchaos.org
STEADY
STATE
HYPOTHESIS
RUN
EXPERIMENT
VERIFY
FIX!
Phases of Chaos Engineering
Chaos engineering
https://github.com/Netflix/SimianArmy
© 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Resiliency: Ability for a system to handle and
eventually recover from unexpected conditions
Partial failure mode
Multi-AZ architecture
Region
Availability zone a Availability zone b Availability zone c
Instances Instances Instances
DB Instance DB instance
standby
Elastic Load
Balancing (ELB)
Multi-AZ architecture
Region
Availability zone a Availability zone b Availability zone c
Instances Instances Instances
DB Instance DB instance
standby
Elastic Load
Balancing (ELB)
Multi-AZ architecture
Region
Availability zone a Availability zone b Availability zone c
Instances Instances Instances
DB Instance DB instance
standby
Elastic Load
Balancing (ELB)
Multi-AZ architecture
Region
Availability zone a Availability zone b Availability zone c
Instances Instances Instances
DB Instance DB instance
new master
Elastic Load
Balancing (ELB)
Availability zone 1
Auto Scaling group
AWS Region
Availability zone 2
Auto-scaling for self-healing
Elastic Load
Balancing (ELB)
X
© 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Elastic load
balancing
Route 53
Corporate data center
Data volume
Application
server
Subordinate
database
server
Reverse
proxy/
caching
server
AWS Region
Reverse proxy/
caching server
Application
server
Primary Database
server
AWS CloudWarm standby prep
www.example.com
Mirroring/replication
Application
data source
cut over
Not active for
production
traffic
Scaled down
standby
Reverse proxy/
caching server
Application
server
Subordinate
database
server
Warm standby recover
www.example.com
Corporate data center
Primary Database
server
Elastic load
balancing
Route 53
Data volume
Application
server
Reverse
proxy/
caching
server
Active
Scaled up
production
AWS Cloud
AWS Region
Elastic load
balancing
Route 53
Corporate data center
Data volume
Application
server
Database
server
Reverse
proxy/
caching
server
AWS Region
Primary Database
server
Active
AWS CloudHot site prep
www.example.com
Mirroring/replication
Application
data source
cut over
Reverse proxy/
caching server
Application
server
Elastic load
balancing
Route 53
Corporate data center
Data volume
Application
server
Database
server
Reverse
proxy/
caching
server
Primary Database
server
Active
Scaled up
for production use
AWS CloudHot site recovery
www.example.com
AWS Region
Reverse proxy/
caching server
Application
server
Availability concepts
High availability
Keep your applications
running 24x7
Backup
Make sure your
data is safe
Disaster recovery
Get your applications
and data back after
a major disaster
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Visit aws.amazon.com/training/path-storage/
Classroom offerings, like Architecting on AWS, feature AWS
expert instructors and hands-on activities
45+ free digital courses cover topics related to cloud storage, including:
Learn storage with AWS Training and Certification
• Amazon S3
• AWS Storage Gateway
• Amazon S3 Glacier
• Amazon Elastic File Storage
(Amazon EFS)
• Amazon Elastic Block Storage
(Amazon EBS)
Resources created by the experts at AWS to help you build cloud storage skills
Thank you!
© 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Cobus Bernard
Sr Developer Advocate
Amazon Web Services
@cobusbernard
cobusbernard
cobusbernard

AWS Webinar 24 - Getting Started with AWS - Understanding DR

  • 1.
    Cobus Bernard Sr DeveloperAdvocate Amazon Web Services Getting Started AWS: Understanding Disaster Recovery @cobusbernard cobusbernard cobusbernard
  • 2.
    Agenda Define requirements &SPOFs Choosing recovery method Backups Testing your plan Resiliency and self-healing systems Using DR as a migration strategy
  • 3.
    © 2020, AmazonWeb Services, Inc. or its affiliates. All rights reserved.
  • 4.
    Initial questions toanswer How important are the applications to your business? What is the associated recovery point and time for these applications? How are you storing the data? Where are you storing the data? How are you restoring the application?
  • 7.
    Protected data Data changingover time RPO t1 Current Why do we backup data? Minimize data loss
  • 8.
    Liability Cost Why do webackup data? Balance cost with liability
  • 9.
    © 2020, AmazonWeb Services, Inc. or its affiliates. All rights reserved.
  • 10.
    AWS offers fourlevels of backup and DR support across a spectrum of complexity and time • Lower priority use cases • Solutions: Amazon S3, AWS Storage Gateway • Cost: $ • Meeting lower RTO & RPO requirements • Core services • Scale AWS resources in response to a DR event • Cost: $$ • Solutions that require RTO & RPO in minutes • Business critical services • Cost: $$$ • Auto-failover of your environment in AWS • Cost: $$$$ RPO/RTO: Hours RPO/RTO: 10s of Minutes RPO/RTO: Minutes RPO/RTO: Real-time Low High Backup & Restore Pilot light Warm standby in AWS Hot standby (with multi-site)
  • 11.
  • 12.
    AWS Backup: centralizecompliance, automate backup, work across services Amazon EFSAmazon EBS Amazon RDS Amazon DynamoDB AWS Storage Gateway AWS Backup 1. Simplified backup scheduling and lifecycle management across AWS services 2. Centrally manage backup activities, security, and reporting 3. Achieve consistency and meet compliance requirements
  • 14.
    Not running Pilot light system Corporatedata center Primary Database server Subordinate database server Data volume Application server Reverse proxy/ caching server AWS Cloud Pilot light prep www.example.com Data mirroring replication Reverse proxy/ caching server Application server
  • 15.
    Reverse proxy/ caching server Application server Startin minutes Add additional capacity, if needed Corporate data center Primary Database server database server Data volume Application server Reverse proxy/ caching server AWS Cloud Pilot light recovery www.example.com
  • 16.
    Elastic load balancing Route 53 Corporatedata center Data volume Application server Subordinate database server Reverse proxy/ caching server AWS Region Reverse proxy/ caching server Application server Primary Database server AWS CloudWarm standby prep www.example.com Mirroring/replication Application data source cut over Not active for production traffic Scaled down standby
  • 17.
    Reverse proxy/ caching server Application server Subordinate database server Warmstandby recover www.example.com Corporate data center Primary Database server Elastic load balancing Route 53 Data volume Application server Reverse proxy/ caching server Active Scaled up production AWS Region AWS Cloud
  • 18.
    Elastic load balancing Route 53 Corporatedata center Data volume Application server Database server Reverse proxy/ caching server AWS Region Primary Database server Active AWS CloudHot site prep www.example.com Mirroring/replication Application data source cut over Reverse proxy/ caching server Application server
  • 19.
    Elastic load balancing Route 53 Corporatedata center Data volume Application server Database server Reverse proxy/ caching server Primary Database server Active Scaled up for production use AWS CloudHot site recovery www.example.com AWS Region Reverse proxy/ caching server Application server
  • 20.
    © 2020, AmazonWeb Services, Inc. or its affiliates. All rights reserved.
  • 21.
  • 22.
    “Chaos Engineering isthe discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production.” http://principlesofchaos.org
  • 23.
  • 24.
  • 25.
    © 2020, AmazonWeb Services, Inc. or its affiliates. All rights reserved.
  • 26.
    Resiliency: Ability fora system to handle and eventually recover from unexpected conditions
  • 27.
  • 28.
    Multi-AZ architecture Region Availability zonea Availability zone b Availability zone c Instances Instances Instances DB Instance DB instance standby Elastic Load Balancing (ELB)
  • 29.
    Multi-AZ architecture Region Availability zonea Availability zone b Availability zone c Instances Instances Instances DB Instance DB instance standby Elastic Load Balancing (ELB)
  • 30.
    Multi-AZ architecture Region Availability zonea Availability zone b Availability zone c Instances Instances Instances DB Instance DB instance standby Elastic Load Balancing (ELB)
  • 31.
    Multi-AZ architecture Region Availability zonea Availability zone b Availability zone c Instances Instances Instances DB Instance DB instance new master Elastic Load Balancing (ELB)
  • 32.
    Availability zone 1 AutoScaling group AWS Region Availability zone 2 Auto-scaling for self-healing Elastic Load Balancing (ELB) X
  • 33.
    © 2020, AmazonWeb Services, Inc. or its affiliates. All rights reserved.
  • 34.
    Elastic load balancing Route 53 Corporatedata center Data volume Application server Subordinate database server Reverse proxy/ caching server AWS Region Reverse proxy/ caching server Application server Primary Database server AWS CloudWarm standby prep www.example.com Mirroring/replication Application data source cut over Not active for production traffic Scaled down standby
  • 35.
    Reverse proxy/ caching server Application server Subordinate database server Warmstandby recover www.example.com Corporate data center Primary Database server Elastic load balancing Route 53 Data volume Application server Reverse proxy/ caching server Active Scaled up production AWS Cloud AWS Region
  • 36.
    Elastic load balancing Route 53 Corporatedata center Data volume Application server Database server Reverse proxy/ caching server AWS Region Primary Database server Active AWS CloudHot site prep www.example.com Mirroring/replication Application data source cut over Reverse proxy/ caching server Application server
  • 37.
    Elastic load balancing Route 53 Corporatedata center Data volume Application server Database server Reverse proxy/ caching server Primary Database server Active Scaled up for production use AWS CloudHot site recovery www.example.com AWS Region Reverse proxy/ caching server Application server
  • 38.
    Availability concepts High availability Keepyour applications running 24x7 Backup Make sure your data is safe Disaster recovery Get your applications and data back after a major disaster
  • 39.
    © 2019, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Visit aws.amazon.com/training/path-storage/ Classroom offerings, like Architecting on AWS, feature AWS expert instructors and hands-on activities 45+ free digital courses cover topics related to cloud storage, including: Learn storage with AWS Training and Certification • Amazon S3 • AWS Storage Gateway • Amazon S3 Glacier • Amazon Elastic File Storage (Amazon EFS) • Amazon Elastic Block Storage (Amazon EBS) Resources created by the experts at AWS to help you build cloud storage skills
  • 40.
    Thank you! © 2020,Amazon Web Services, Inc. or its affiliates. All rights reserved. Cobus Bernard Sr Developer Advocate Amazon Web Services @cobusbernard cobusbernard cobusbernard

Editor's Notes

  • #11 Recovery Point Objective Recovery Time Objective
  • #25 Chaos Monkey - Kill instances randomly Latency Monkey - Induce latency in services Chaos Gorilla - Simulates AZ and regions failure Conformity Monkey - Make sure instances follow good practices
  • #29 Reduce possibility of correlated failure
  • #31 First the load balancer will detect the zone failure and stop sending traffic to that failed AZ
  • #32 Then RDS, in that case, will also detect the issue and failover to the healthy AZ and promote the standby to become the new master automatically. And after few seconds, the application will be up and running again.
  • #33 The most common purpose for an auto scaling groups is resiliency; instances are put into a fixed-size auto scaling group so that if an instance fails, it is automatically replaced. The simplest use case is an auto scaling group has a min size bigger than 1.