Towards Apache Flink 2.0 - Unified Data Processing and Beyond, Bowen Li

Towards Apache Flink 2.0
- Unified Data Processing and Beyond
Bowen Li
Committer@Apache Flink, Senior Engineer@Alibaba
Nov 14, 2019 @Scale by the bay

Agenda
● Flink use cases
● Flink in a nutshell - what makes Flink successful in stream
processing
● Beyond stream processing
○ Batch
○ Hive and Data warehousing
○ AI/Machine Learning
○ Serverless

Flink at Alibaba
● Powers real time computations of all business units at Alibaba group
● Powers all search and recommendations, both online and offline
● Provided as cloud service to public on Ali Cloud

Single’s Day Global Shopping Festival on 11/11

Single’s Day Stats - Nov 11, 2019
Alibaba
○ GMV
■ $14 million in the first 21s
■ $1.4 billion in the first 96s
■ $38 billion in 24h
○ 982 PB data generated in total
○ 544,000 transactions/sec at peak
Flink @ Alibaba
○ 2 billion events/sec, 3 TB/sec - up 111% from 2018

Flink at Alibaba
Use Case 1: online ML

○ hundreds of millions events
○ 100+ billions features
○ e2e second latency
○ real-time training, feature and module update
Flink at Alibaba
Use Case 1: online ML

Flink at Alibaba
Use Case 2: Real Time GMV dashboard

Flink at Scale
Real Time AI / ML
Real Time AnalyticsReal Time Fraud Detection/Risk Management
Real Time Dynamic Pricing

Flink at Scale
Real Time Compute Service on public cloud backed by Flink
Kinesis Analytics

Flink in a Nutshell
- key differentiaters from other open source solutions
ancient squirrel from Ice Age!

Flink in a Nutshell
Stateful Computations …...

Why (built-in) state matters?
● computation with context, rather than single record
● no network IO, lower latency
● no external dependency, full control by framework for
exactly-once semantics
● ...

Flink provides built-in state backends that support rich,
arbitrary data structure in a fully fault tolerant way
○ In-memory, splliable backend
○ RocksDB backend

Stateful Computations over Event Streams…...
Flink in a Nutshell

It means a few things…
1. All your data is data streams!
○ batch v.s. streaming - just execution models
○ bounded v.s. unbounded data streams - key difference
○ technically all data processing is stream processing

2. Streaming-first, pipelined execution
○ record flowing thru the system
-> extremely high throughput, ultra low latency
○ fondamentally different from batch-first, staged-execution
framework
○ can’t achieved by mini-batch workaround

3. Event comes with time!
○ event time v.s. processing time
○ windows aggregation, sessionazation, time-base joins
○ out-of-order and late data
Flink support all the most comprehensive time semantics natively
from the beginning

Stateful Computations over Event Streams
in an Expressive …...
Flink in a Nutshell

in an Expressive …...
Layered APIs with the most comprehensive semantics

in an Expressive, Scalable …...
Flink in a Nutshell

in an Expressive, Scalable …...
● Horizontally scalable
● Battle tested
○ trillions of records per day
○ terabytes of state
○ run on thousands of cores

in an Expressive, Scalable, Operational-focused
…...
Flink in a Nutshell

in an Expressive, Scalable, Operational-focused
…...
● Deploy anywhere
○ kubernetes, yarn, mesos, standalone
● Deploy flexibly
○ per-job mode, session mode
● Highly available
○ with HA setup

in an Expressive, Scalable, Operational-focused,
Exactly-Once and Fault Tolerant way
Flink in a Nutshell

in an Expressive, Scalable, Operational-focused,
● Exactly-once state consistency
● End-to-end exactly-once consistency with connectors
support transactions
● Checkpoint/Savepoint
○ on-the-fly, don’t scrafice performance

in an
Expressive,
Scalable,
Operational-focused,

Going Beyond Stream Processing
● Batch -> Unified Data Processing
● Data Warehousing and Notebook
● Machine Learning / AI / DL
● Serverless

Recap the lambda architecture ......
○ infra: high operation cost
○ dev: costly maintenance, and hard to learn 2+ stack
○ business: hard to sync to guarantee logic consistency
Why Unified Streaming and Batch Data Processing?
MQ / Pub-Sub
HDFS / S3
Stream Processing
(online)
Batch Processing
(offline)
Combine Results
(serving)

Why Flink?
Flink’s philosophy:
streaming first, with batch is a special case of streaming

State-of-the-Art Batch Processing on a Stream Processor
<= Flink 1.8 from Flink 1.9

* Blink is an internal fork of Flink at Alibaba. Blink has become a part of Flink from release 1.9
Performance of Blink versus Spark in the TPC-DS benchmark, aggregate time for all queries together.
Presentation by Xiaowei Jiang at Flink Forward Beijing, 2018.

Data Warehousing
Initial integration in Flink 1.9 for Hive 2.3.4 and 1.2.1
Full integration (read, write, udf) in Flink 1.10 for all Hive 1.x, 2.x, and 3.x

Machine Learning & AI
Recap the “lambda” architecture, again, for ML
MQ / Pub-Sub
HDFS / S3
Online
Training
Offline
Training
Model
Validation
Preprocessing
Dynamic Model
Static Model
Static Model
Preprocessing
Inference

Flink is popular for online ML now
MQ / Pub-Sub
HDFS / S3
Online
Training
Offline
Training
Model
Validation
Preprocessing
Dynamic Model
Static Model
Static Model
Preprocessing
Inference

Streaming-Batch Unified ML
Use Flink everywhere to reduce maintenance and operation cost
MQ / Pub-Sub
HDFS / S3
Online
Training
Offline
Training
Model
Validation
Preprocessing
Dynamic Model
Static Model
Static Model
Preprocessing
Inference

Machine Learning
ML Stage
ML Flow
Efforts &
Requirements
MQ / Pub-Sub
HDFS / S3
Online Training
Offline Training
Model
Validation
Preprocessing
Dynamic Model
Static Model
Static Model
Preprocessing
Inference
Data Acquisition Preprocessing Model Training Model Validation
& Serving
Inference
Rich Connectors
Dataset
Management
Stream-Batch unification
Strong API & SQL Support
Enhanced Iteration
Flink ML lib
DL on Flink
(TF, PyTorch)
Model Serving
Model Registry
& Management
Rollout/Rollback
Online
Evaluation
Flink ML Pipeline
Python API support

Flink ML Libs
● Completely rewrote
● Based on ML pipeline, powered by Table API
● Battle tested algorithms
○ K-means
○ Naive Bayes
○ Linear regression
○ GBDT
○ Decision tree
○ PCA
○ Random
forest
○ Correlation
○ ….

Flink ML Pipeline
training
inferenc
e
input table 1 ModelTransformerEstimatorTransformer
input table 2 result tableModelTransformer
Two type of operators
● Transformer: data -> data
● Estimator: data -> model
Estimator Pipeline Model Pipeline
Model Pipeline
pipeline.fit(input1)
pipeline.transform(input2)

Deep Learning Pipeline
Data Acquisition
Data Process
& Tranformation
Model Training Model Validation Model Serving
Parameter
Tuning

Deep Learning Pipeline
source 1
source 2
join udtf
Flink Cluster
External
MQ / FS
Tensorflow
Cluster
worker workerworker
PS PS

Flink Deep Learning Pipeline
Data Acquisition
Data Process
& Tranformation
Model Training Model Validation Model Serving
Parameter
Tuning

Flink + Tensorflow Integration
source 1
source 2
join udtf
Flink Cluster
External
MQ / FS
Tensorflow
Cluster
worker workerworker
PS PS
source 1
source 2
join udtf
A single Flink Cluster
worker workerworker
PS PS

DL on Flink ML Pipeline
source 1
source 2
join udtf
worker workerworker
PS PS
Transformer Estimator

Serverless
Event-Driven, Function as a Service
Benefits:
● elastic
● lightweight
Challenges:
○ state
■ consistency
■ IO
■ capacity
○ hard to build complex
applications

Event Driven State Management Composable
isn’t that….

Event Driven State Management Composable
isn’t that….
Stream Processing!

Check out project State Function, announced in
Oct 2019! https://statefun.io/
It officially became part of Apache Flink last
week.

Q & A
Email: user/dev@flink.apache.org
Twitter: @ApacheFlink @Bowen__Li

Towards Apache Flink 2.0 - Unified Data Processing and Beyond, Bowen Li

More Related Content

What's hot

Similar to Towards Apache Flink 2.0 - Unified Data Processing and Beyond, Bowen Li

More from Bowen Li

Recently uploaded

Towards Apache Flink 2.0 - Unified Data Processing and Beyond, Bowen Li