Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
Big Data Conference in Vilnius 2018
Kai Sasaki
Infrastructure for
Auto Scaling
Distributed System
Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
Bio
Kai Sasaki (佐々木 海)
• Senior Software Engineer at Arm Treasure Data since 2015
• Hadoop, Presto, Spark, TensorFlow.js, Apache Hivemall
• Books
– Available as paperback
and ebook.
• Twitter
– @Lewuathe
Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
Agenda
• Who is Treasure Data?
• What is distributed data analysis?
• What kind of challenges we have?
– Operational Cost
– Stability and Scalability
• Our Approach
– AWS CodeDeploy & Auto Scaling Group
– Query Simulation
– Graceful/Force Shutdown
Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
Who is Treasure Data?
Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
Treasure Data
Founded in Dec, 2011 in Silicon Valley
• Mountain View, CA
• DMP, eCDP, IoT, Cloud
• We joined Arm Oct, 2018
Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
Treasure Data
We are providing end-to-end integrated data analysis platform.
• Data Ingestion
– Mobile Device, Automotive, IoT
• Enterprise Customer Data Platform
• Service Integration
– BI tool (e.g. Tableau)
– Marketing tool
Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
Treasure Data
Open Source Lover
• Fluentd
• Embulk
• Digdag
• Apache Hivemall
Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
Enterprise Data Analysis
• Scalable processing
• Reliable platform
• Secure data protection
Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
Arm Pelion Platform
Treasure Data is a part of Arm Pelion IoT Platform
• Flexibility in connectivity management
• Efficient data processing
Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
Distributed Data
Analysis
Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
Distributed Data Analysis
Service component that enables us to process huge dataset
Scalability Throughput Data Consistency
• Easy to do horizontal scaling
• Flexible to the business
requirement
– Interface (e.g. SQL)
– Data Format
• Impossible scale with single
node machine
• Business requirement for batch
processing (e.g. daily batch)
• Write side operation is possible
– INSERT, DELETE, UPDATE
• Correct measurement is the
key for data analysis
Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
Distributed Processing Engines
Bunch of open source softwares are available for distributed processing
• Hadoop
• Presto
• Spark
• Kafka
Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
Typical Architecture
Master-Worker Model
https://www.tutorialspoint.com/apache_presto/apache_presto_architecture.htm
Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
Distributed Plan
select
t1.class,
t2.features,
count(1)
from iris t1
join iris t2
on t1.class = t2.class
group by 1, 2;
Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
Challenges
Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
Challenges for Distributed Data Analysis
Maintaining distributed data analysis platform in real world is not easy.
• Operation
– Deployment
– Logging Investigation
– Monitoring
• Money
– Large Scale Cluster
– Network Cost
• Stability
– Capacity Sufficiency
Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
Challenges for Distributed Data Analysis
Manual launch/termination?
Capacity estimation is correct?
Which version is deployed?
What kind of metrics do we
need to monitor?
How much does it cost?
Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
Challenges for Distributed Data Analysis
Manual launch/termination?
Capacity estimation is correct?
Which version is deployed?
What kind of metrics do we
need to monitor?
How much does it cost?
MANUALLY
Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
Our Approach
Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
Our Approach
Practical solutions by taking full advantage of public cloud services
• AWS CodeDeploy
– Integration with Auto Scaling Group
• EC2 Auto Scaling Group
– Load test by Query Simulation
– Metric Based Capacity Estimation
– Graceful/Force Instance Termination
Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
CodeDeploy
Deployment Service for Deployment in AWS
• Easy to Integrate with Auto Scaling Group
• Available Everywhere
– Supporting On-Premise Instances
• Scalable for distributed system use cases
• https://docs.aws.amazon.com/codedeploy/index.html
Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
Auto Scaling System
System should be scaled automatically without any manual operation
• Load test by Query Simulation
• Metric Based Capacity Estimation
• Graceful Termination & Force Termination
Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
Query Simulation
Load test should be based on the real world workload.
• Get query list from the past history of our customer
• Query signature clustering
• Construct data set and query list based on the list
• That enables us to do load test easily based on production workload
Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
Query Signature
Query signature represents a query in a shortened format.
Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
Query Simulation
Conductor
c5.9xlarge
1. Get raw query list 2. Construct test data and query list
Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
Metric Based Capacity Estimation
Designed to achieve target metric value by adjusting capacity
• Add/reduce instances proportional to the target metric value
• e.g. Target average CPU usage = 40%
Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
Metric Based Capacity Estimation
Designed to achieve target metric value by adjusting capacity
• 40% is the threshold to balance the cost and performance
Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
Graceful Termination
Terminating instances gracefully
• Avoid making worse user experience
• Lifecycle hook in auto scaling group
• Cron job to check running tasks
– Number of tasks in the worker
– Send completion to lifecycle hook
https://docs.aws.amazon.com/autoscaling/ec2/userguide/AutoScalingGroupLifecycle.html
Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
Graceful Termination
Terminating instances gracefully
1. Instance is moved to Terminating:Wait status
2. Cron job make the state transition to Terminating:Proceed
3. The instance is gracefully terminated
Send complete lifecycle hook
ASG terminate the instance
Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
Force Termination
Long running task can block graceful termination
• Put “timeout” limitation
• Simulate “how long it takes to terminate gracefully”
Date Time
Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
Instance Termination
Balance between customer experience and cost optimization.
Graceful Termination
Keep queries running as much as possible
satisfies customer expectation.
• Non fault tolerant system such as Presto
• Distributed analysis workload tends to be too long
to be retried
Force Termination
Cost optimization is one of the primary
goal of auto scaling
• Auto scale out/in around 10 minutes does not lose
agility for capacity adjustment.
• Force termination happening only over 10 mins
queries is acceptable
Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
Recap
• Who is Treasure Data?
• What is distributed data analysis?
• What kind of challenges we have?
– Operational Cost
– Stability and Scalability
• Our Approach
– AWS CodeDeploy & Auto Scaling Group
– Query Simulation
– Graceful/Force Shutdown
Thank You!
Danke!
Merci!
谢谢!
Gracias!
Kiitos!
Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.

Infrastructure for auto scaling distributed system

  • 1.
    Copyright 1995-2018 ArmLimited (or its affiliates). All rights reserved. Big Data Conference in Vilnius 2018 Kai Sasaki Infrastructure for Auto Scaling Distributed System
  • 2.
    Copyright 1995-2018 ArmLimited (or its affiliates). All rights reserved. Bio Kai Sasaki (佐々木 海) • Senior Software Engineer at Arm Treasure Data since 2015 • Hadoop, Presto, Spark, TensorFlow.js, Apache Hivemall • Books – Available as paperback and ebook. • Twitter – @Lewuathe
  • 3.
    Copyright 1995-2018 ArmLimited (or its affiliates). All rights reserved. Agenda • Who is Treasure Data? • What is distributed data analysis? • What kind of challenges we have? – Operational Cost – Stability and Scalability • Our Approach – AWS CodeDeploy & Auto Scaling Group – Query Simulation – Graceful/Force Shutdown
  • 4.
    Copyright 1995-2018 ArmLimited (or its affiliates). All rights reserved. Who is Treasure Data?
  • 5.
    Copyright 1995-2018 ArmLimited (or its affiliates). All rights reserved. Treasure Data Founded in Dec, 2011 in Silicon Valley • Mountain View, CA • DMP, eCDP, IoT, Cloud • We joined Arm Oct, 2018
  • 6.
    Copyright 1995-2018 ArmLimited (or its affiliates). All rights reserved. Treasure Data We are providing end-to-end integrated data analysis platform. • Data Ingestion – Mobile Device, Automotive, IoT • Enterprise Customer Data Platform • Service Integration – BI tool (e.g. Tableau) – Marketing tool
  • 7.
    Copyright 1995-2018 ArmLimited (or its affiliates). All rights reserved. Treasure Data Open Source Lover • Fluentd • Embulk • Digdag • Apache Hivemall
  • 8.
    Copyright 1995-2018 ArmLimited (or its affiliates). All rights reserved. Enterprise Data Analysis • Scalable processing • Reliable platform • Secure data protection
  • 9.
    Copyright 1995-2018 ArmLimited (or its affiliates). All rights reserved. Arm Pelion Platform Treasure Data is a part of Arm Pelion IoT Platform • Flexibility in connectivity management • Efficient data processing
  • 10.
    Copyright 1995-2018 ArmLimited (or its affiliates). All rights reserved. Distributed Data Analysis
  • 11.
    Copyright 1995-2018 ArmLimited (or its affiliates). All rights reserved. Distributed Data Analysis Service component that enables us to process huge dataset Scalability Throughput Data Consistency • Easy to do horizontal scaling • Flexible to the business requirement – Interface (e.g. SQL) – Data Format • Impossible scale with single node machine • Business requirement for batch processing (e.g. daily batch) • Write side operation is possible – INSERT, DELETE, UPDATE • Correct measurement is the key for data analysis
  • 12.
    Copyright 1995-2018 ArmLimited (or its affiliates). All rights reserved. Distributed Processing Engines Bunch of open source softwares are available for distributed processing • Hadoop • Presto • Spark • Kafka
  • 13.
    Copyright 1995-2018 ArmLimited (or its affiliates). All rights reserved. Typical Architecture Master-Worker Model https://www.tutorialspoint.com/apache_presto/apache_presto_architecture.htm
  • 14.
    Copyright 1995-2018 ArmLimited (or its affiliates). All rights reserved. Distributed Plan select t1.class, t2.features, count(1) from iris t1 join iris t2 on t1.class = t2.class group by 1, 2;
  • 15.
    Copyright 1995-2018 ArmLimited (or its affiliates). All rights reserved. Challenges
  • 16.
    Copyright 1995-2018 ArmLimited (or its affiliates). All rights reserved. Challenges for Distributed Data Analysis Maintaining distributed data analysis platform in real world is not easy. • Operation – Deployment – Logging Investigation – Monitoring • Money – Large Scale Cluster – Network Cost • Stability – Capacity Sufficiency
  • 17.
    Copyright 1995-2018 ArmLimited (or its affiliates). All rights reserved. Challenges for Distributed Data Analysis Manual launch/termination? Capacity estimation is correct? Which version is deployed? What kind of metrics do we need to monitor? How much does it cost?
  • 18.
    Copyright 1995-2018 ArmLimited (or its affiliates). All rights reserved. Challenges for Distributed Data Analysis Manual launch/termination? Capacity estimation is correct? Which version is deployed? What kind of metrics do we need to monitor? How much does it cost? MANUALLY
  • 19.
    Copyright 1995-2018 ArmLimited (or its affiliates). All rights reserved. Our Approach
  • 20.
    Copyright 1995-2018 ArmLimited (or its affiliates). All rights reserved. Our Approach Practical solutions by taking full advantage of public cloud services • AWS CodeDeploy – Integration with Auto Scaling Group • EC2 Auto Scaling Group – Load test by Query Simulation – Metric Based Capacity Estimation – Graceful/Force Instance Termination
  • 21.
    Copyright 1995-2018 ArmLimited (or its affiliates). All rights reserved. CodeDeploy Deployment Service for Deployment in AWS • Easy to Integrate with Auto Scaling Group • Available Everywhere – Supporting On-Premise Instances • Scalable for distributed system use cases • https://docs.aws.amazon.com/codedeploy/index.html
  • 22.
    Copyright 1995-2018 ArmLimited (or its affiliates). All rights reserved. Auto Scaling System System should be scaled automatically without any manual operation • Load test by Query Simulation • Metric Based Capacity Estimation • Graceful Termination & Force Termination
  • 23.
    Copyright 1995-2018 ArmLimited (or its affiliates). All rights reserved. Query Simulation Load test should be based on the real world workload. • Get query list from the past history of our customer • Query signature clustering • Construct data set and query list based on the list • That enables us to do load test easily based on production workload
  • 24.
    Copyright 1995-2018 ArmLimited (or its affiliates). All rights reserved. Query Signature Query signature represents a query in a shortened format.
  • 25.
    Copyright 1995-2018 ArmLimited (or its affiliates). All rights reserved. Query Simulation Conductor c5.9xlarge 1. Get raw query list 2. Construct test data and query list
  • 26.
    Copyright 1995-2018 ArmLimited (or its affiliates). All rights reserved. Metric Based Capacity Estimation Designed to achieve target metric value by adjusting capacity • Add/reduce instances proportional to the target metric value • e.g. Target average CPU usage = 40%
  • 27.
    Copyright 1995-2018 ArmLimited (or its affiliates). All rights reserved. Metric Based Capacity Estimation Designed to achieve target metric value by adjusting capacity • 40% is the threshold to balance the cost and performance
  • 28.
    Copyright 1995-2018 ArmLimited (or its affiliates). All rights reserved. Graceful Termination Terminating instances gracefully • Avoid making worse user experience • Lifecycle hook in auto scaling group • Cron job to check running tasks – Number of tasks in the worker – Send completion to lifecycle hook https://docs.aws.amazon.com/autoscaling/ec2/userguide/AutoScalingGroupLifecycle.html
  • 29.
    Copyright 1995-2018 ArmLimited (or its affiliates). All rights reserved. Graceful Termination Terminating instances gracefully 1. Instance is moved to Terminating:Wait status 2. Cron job make the state transition to Terminating:Proceed 3. The instance is gracefully terminated Send complete lifecycle hook ASG terminate the instance
  • 30.
    Copyright 1995-2018 ArmLimited (or its affiliates). All rights reserved. Force Termination Long running task can block graceful termination • Put “timeout” limitation • Simulate “how long it takes to terminate gracefully” Date Time
  • 31.
    Copyright 1995-2018 ArmLimited (or its affiliates). All rights reserved. Instance Termination Balance between customer experience and cost optimization. Graceful Termination Keep queries running as much as possible satisfies customer expectation. • Non fault tolerant system such as Presto • Distributed analysis workload tends to be too long to be retried Force Termination Cost optimization is one of the primary goal of auto scaling • Auto scale out/in around 10 minutes does not lose agility for capacity adjustment. • Force termination happening only over 10 mins queries is acceptable
  • 32.
    Copyright 1995-2018 ArmLimited (or its affiliates). All rights reserved. Recap • Who is Treasure Data? • What is distributed data analysis? • What kind of challenges we have? – Operational Cost – Stability and Scalability • Our Approach – AWS CodeDeploy & Auto Scaling Group – Query Simulation – Graceful/Force Shutdown
  • 33.
    Thank You! Danke! Merci! 谢谢! Gracias! Kiitos! Copyright 1995-2018Arm Limited (or its affiliates). All rights reserved.