Scalable data pipeline

Scalable Data Pipeline
Be ready to big data challenges with
by Leonid Sokolov
Big Data Architect at greenm.io

Agenda
• Data Lake Monsters Recap
• Scale Distributed Data Processing
• Technologies

Data Pipeline Challenges
• Complex workflow management
• AWS Athena doesn’t scale well

Scale Distributed Data
Processing

Basic Data Actions
Extract
Transform
Load

Extract Transform Load

Extract
Transform Load
Input Process Output Input Process Output Input Process Output

Input Process OutputExtract
Database: Partition by Integer Key

+1000000
13569933
+ ???
???
12569933 13680800900000083645ec0c06727066d249cfd01 ffffffcb586df761f63f561d946ac7c5
Database: Partition by Varchar Key

Database: Varchar key

• Use scalable storage: HDFS, SЗ
• Use multiple files
• Use splittable file formats and compression
Format Splittable
CSV Yes*
JSON Yes**
Parquet Yes
Compression Splittable
gzip No
bzip2 Yes
Snappy No
Files
• * CSV is splittable when it is a raw, uncompressed file or using a splittable compression format such as BZIP2
• ** JSON has the same conditions about splittability when compressed as CSV with one extra difference.
When “wholeFile” option is set to true (re: SPARK-18352), JSON is NOT splittable.

• Use scalable storage: HDFS, SЗ
• Spark on S3: mapreduce.fileoutputcommitter.algorithm.version = 2
• EMR 5.20.0 or later
• Use multiple files (better the same number as in input)

Extract Results
0
10
20
30
40
50
60
70
80
Extract Time (minutes)
Before After
• EMR 5.20.0 with 10 instances (c4.4xlarge)
• Input: 3 Databases (MS SQL), ~400GB (Raw Data)
• Output: Parquet(Snappy), ~100GB

Input Process OutputTransform
Volume of Data
Partitions
Data Skew
Volume of Data
Partitions

Map Shuffle Reduce
SELECT * FROM Encounters e JOIN Providers p ON e.ProviderId = p.ProviderId
Shuffle Join
Shuffle Map

Map Reduce
SELECT * FROM Encounters e JOIN Providers p ON e.ProviderId =p.ProviderId
Broadcast Join
Broadcast Collect

• Use data partitioning, bucketing, sorting
• Broadcast small tables when joining them to big table
• spark.sql.autoBroadcastJoinThreshold= 10485760 (10 MB, default)
• Use COUNT(key) instead of COUNT(DISTINCT key) if possible
• Drop unused data
• Filter/reduce before join
• Cache Datasets used multiple times
Reduce Shuffles

Transform
0
10
20
30
40
50
60
Transform Time(minutes)
Before After
Results
• EMR 5.20.0 with 10 instances (c4.8xlarge)
• Input: Parquet(Snappy), 50GB
• Output: ORC(ZLib), 19GB

Input Process Output Input Process Output Input Process Output

Input Process OutputLoad
COPY dm1.FactTable
(
Column1,
Column2,
DateColumnTZ FILLER TIMESTAMPTZ,
DateColumn AS DateColumnTZ AT TIME ZONE 'UTC',
Column4,
...
)
FROM 's3://bucket/prod/datamarts/dm1/FactTable/snapshots/snapshotid=20190418/part-*.orc'
ORC
DIRECT
ABORT ON ERROR;

Load Results
0
20
40
60
80
100
Load Time(minutes)
Before After
• Input: ORC(ZLib), 19GB
• Output: Vertica DB (7 Node)

Summary
• Build architecture for scale
• Consider tomorrow’s data volume
• Build with failure in mind
• Understand the risks and be ready to respond

Technologies
Environment AWS, EMR 5.20.0 AWS, EMR 5.20.0 AWS Batch
Technology Spark 2.4 Spark 2.4 Vertica
Languages Scala + SQL Scala + SQL Python + SQL
Input
Format
Compression
MS SQL, MySQL
Tables
-
S3
Parquet
Snappy
S3
ORC, Parq
Zlib, Snappy
Output
Format
Compression
S3
Parquet
Snappy
S3
ORC ,Parquet
Zlib, Snappy
Vertica
Tables
Native

Scalable data pipeline

More Related Content

What's hot

Similar to Scalable data pipeline

More from GreenM

Recently uploaded

Scalable data pipeline

Editor's Notes