Continuous Optimization for Distributed
BigData Analysis
Kai Sasaki (Treasure Data)
Bio
Kai Sasaki
- Software Engineer at Treasure Data
- Hadoop, Presto
- Apache Hivemall
- Books
2
3
Design and Concept
https://pixabay.com/en/desktop-tidy-clean-mockup-white-2325627/
Agenda
- Who is Treasure Data
- What is distributed data analysis?
- What kind of challenges we have?
- Our approach
- Columnar Storage
- Partitioning
- Repartitioning
4
Treasure Data
5
Treasure Data
• Founded in Dec, 2011
• Mountain View, CA
• DMP, CDP, IoT, Cloud
• We joined Arm Oct, 2018
6
Treasure Data
• Open Source Lovers
7
Enterprise Data Analysis
8
Arm x Treasure Data
• Pelion: Device-to-Device Platform
9
10
Challenges
based on Our Experience
https://pixabay.com/en/adventure-height-climbing-mountain-1807524/
Distributed Data Analysis?
• Large Scale Data
• High Throughput
• High Availability & Reliability
• Data Consistency
11
Distributed Processing Engines
• Hadoop
• Presto
• Spark
12
Typical Architecture
• Master-Worker model
13
https://www.tutorialspoint.com/apache_presto/apache_presto_architecture.htm
Distributed Plan
14
select
t1.class,
t2.features,
count(1)
from iris t1
join iris t2
on t1.class = t2.class
group by 1, 2;
Challenges
• Network Bandwidth
• Throughput
• Transactional Processing
• Data Consistency
• System Reliability
• Service Availability
15
Our Approach
• Columnar Storage
• MessagePack based columnar format
• Time Index Pushdown
• Optimization of Partitioning Layout
16
Columnar Storage
• General design for OLAP workload
• Save IO bandwidth
• Efficient compression and encoding
• e.g. Parquet, ORC
17
MessagePack
• JSON-like binary serialization format
• Faster and smaller
• 100+ 

implementations
• https://msgpack.org
18
MessagePack x Columnar File
• Type embedded file format
• Schema-on-Read
• -> Saving network bandwidth and storage
space efficiently
19
MessagePack x Columnar File
20
Time Index Pushdown
• Read skipping by time range
• Fitting to the typical analytical use cases
• Saving network bandwidth
21
Time Index Pushdown
• Indexed by PostgreSQL
• Transactional Update
• Data Consistency
• GiST index achieves efficient multi
column index
22
Time-Range Partitioning
23
Time Index Pushdown
24
Partition Size?
• The partition file size affects the
performance significantly
• 1000000 records / file
• 256MB / file
• But depends on the workload
25
Auto Optimization
• Partitioning layout should be fit to the
actual workload
• File size
• Time range
• Partitioning key
26
Repartitioning
• Small distributed partition files
• High IO overhead
• Few large partition files
• High memory pressure
TRADE OFF PROBLEM
27
Repartitioning
• Partitioning key decides the throughput
• e.g. Customer segmentation by
• User ID
• Purchase item
• Living address
28
User Defined Partitioning
• Custom partitioning schema defined by
our user side (or ourselves)
29
User Defined Partitioning
30
Colocated Join
31
User Defined Partitioning
32
User Defined Partitioning
• Granularity
• Partitioning Key Selection
33
Stella Connector
• Repartitioning & UDP is designed as a
Presto connector
• Make use of Presto high scalability and
reliability for such high workload
34
Stella Connector
35
CREATE TABLE remerged WITH (max_file_size = '256MB', max_time_range='48h') AS
SELECT * FROM partition.sources
WHERE table_schema = 'tpch_s1'
AND table_name = 'lineitem' AND TD_TIME_RANGE(time, '1998-10-11', '1998-10-20')
Stella Connector
• Scalable
• Reliable
• Easy to embed it into Workflow
• Automatic Storage Optimization!
36
Recap
- Treasure Data Overview
- Architecture of Distributed Data Analysis
- Challenges
- Our Approach
- Columnar Storage
- Partitioning
- Repartitioning
37
Thanks!
38

Continuous Optimization for Distributed BigData Analysis