Till Rohrmann
Flink PMC member
trohrmann@apache.org
@stsffap
Interactive Data Analysis
with Apache Flink
Data Analysis
1
Exploratory Data Analysis
§  Visualize data
§  Calculate main
characteristics
§  Understand data and
find possibly new
hypothesis
2
Data Analysts
3
Read-Evaluate-Print Loop
§  New Scala shell offers REPL
§  Interactive queries
§  Let’s you explore data quickly
4
Scala Shell
5
Simple Scala Shell Example
6
Problems
§  No visualization
§  No saving or replaying of written code
§  No assistance à Bad IDE
7
Notebooks
§  Web-based interactive
computation
environment
§  Combines rich text,
execution code, plots
and rich media
§  Storytelling
8
Apache Zeppelin
§  Web-based REPL with pluggable
interpreters
§  Since 2014 in the Apache Incubator
§  Supported interpreters:
•  Flink
•  Spark
•  Python
•  Markdown
•  Many more …
9
Word Count with Zeppelin
§  Find the 10 most frequent words with
more than 4 letters in the King James
version of the bible.
10
11
12
13
14
Linear regression
§  Let’s predict the influence of advertisement
spending on sales
§  Input data set:
http://www-bcf.usc.edu/~gareth/ISL/
Advertising.csv
§  Features:
•  TV advertisement money
•  Radio advertisement money
•  Newspaper advertisement money
§  Response:
•  Sales
15
16
17
18
19
20
21
22
23
24
Classification
§  Let’s build a classifier for insult detection
§  Kaggle challenge
https://www.kaggle.com/c/detecting-
insults-in-social-commentary
§  Label: 1 – Insult, 0 – No insult
§  Feature: Comment text
25
26
27
Conclusion
§  Interactive data analysis is really easy with
Apache Flink
§  Apache Zeppelin is great interactive
notebook
§  Zeppelin and Flink play well together to
solve machine learning tasks and more
28
29
flink.apache.org
@ApacheFlink

Interactive Data Analysis with Apache Flink @ Flink Meetup in Berlin