Continuous Optimization for Distributed BigData Analysis

Continuous Optimization for Distributed
BigData Analysis
Kai Sasaki (Treasure Data)

Bio
Kai Sasaki
- Software Engineer at Treasure Data
- Hadoop, Presto
- Apache Hivemall
- Books
2

3
Design and Concept
https://pixabay.com/en/desktop-tidy-clean-mockup-white-2325627/

Agenda
- Who is Treasure Data
- What is distributed data analysis?
- What kind of challenges we have?
- Our approach
- Columnar Storage
- Partitioning
- Repartitioning
4

Treasure Data
• Founded in Dec, 2011
• Mountain View, CA
• DMP, CDP, IoT, Cloud
• We joined Arm Oct, 2018
6

Treasure Data
• Open Source Lovers
7

Arm x Treasure Data
• Pelion: Device-to-Device Platform
9

10
Challenges
based on Our Experience
https://pixabay.com/en/adventure-height-climbing-mountain-1807524/

Distributed Data Analysis?
• Large Scale Data
• High Throughput
• High Availability & Reliability
• Data Consistency
11

Distributed Processing Engines
• Hadoop
• Presto
• Spark
12

Typical Architecture
• Master-Worker model
13
https://www.tutorialspoint.com/apache_presto/apache_presto_architecture.htm

Distributed Plan
14
select
t1.class,
t2.features,
count(1)
from iris t1
join iris t2
on t1.class = t2.class
group by 1, 2;

Challenges
• Network Bandwidth
• Throughput
• Transactional Processing
• System Reliability
• Service Availability
15

Our Approach
• Columnar Storage
• MessagePack based columnar format
• Time Index Pushdown
• Optimization of Partitioning Layout
16

Columnar Storage
• General design for OLAP workload
• Save IO bandwidth
• Efficient compression and encoding
• e.g. Parquet, ORC
17

MessagePack
• JSON-like binary serialization format
• Faster and smaller
• 100+  
implementations
• https://msgpack.org
18

MessagePack x Columnar File
• Type embedded file format
• Schema-on-Read
• -> Saving network bandwidth and storage
space efficiently
19

MessagePack x Columnar File
20

Time Index Pushdown
• Read skipping by time range
• Fitting to the typical analytical use cases
• Saving network bandwidth
21

Time Index Pushdown
• Indexed by PostgreSQL
• Transactional Update
• GiST index achieves efficient multi
column index
22

Partition Size?
• The partition file size affects the
performance significantly
• 1000000 records / file
• 256MB / file
• But depends on the workload
25

Auto Optimization
• Partitioning layout should be fit to the
actual workload
• File size
• Time range
• Partitioning key
26

Repartitioning
• Small distributed partition files
• High IO overhead
• Few large partition files
• High memory pressure
TRADE OFF PROBLEM
27

Repartitioning
• Partitioning key decides the throughput
• e.g. Customer segmentation by
• User ID
• Purchase item
• Living address
28

User Defined Partitioning
• Custom partitioning schema defined by
our user side (or ourselves)
29

User Defined Partitioning
• Granularity
• Partitioning Key Selection
33

Stella Connector
• Repartitioning & UDP is designed as a
Presto connector
• Make use of Presto high scalability and
reliability for such high workload
34

Stella Connector
35
CREATE TABLE remerged WITH (max_file_size = '256MB', max_time_range='48h') AS
SELECT * FROM partition.sources
WHERE table_schema = 'tpch_s1'
AND table_name = 'lineitem' AND TD_TIME_RANGE(time, '1998-10-11', '1998-10-20')

Stella Connector
• Scalable
• Reliable
• Easy to embed it into Workflow
• Automatic Storage Optimization!
36

Recap
- Treasure Data Overview
- Architecture of Distributed Data Analysis
- Challenges
- Our Approach
- Columnar Storage
- Partitioning
- Repartitioning
37

Continuous Optimization for Distributed BigData Analysis

More Related Content

What's hot

Similar to Continuous Optimization for Distributed BigData Analysis

More from Kai Sasaki

Recently uploaded

Continuous Optimization for Distributed BigData Analysis