A CD Framework For Data Pipelines
Yaniv Rodenski
@YRodenski
yaniv@apache.org
Archetypes of Data Pipelines Builders
• Exploratory workloads
• Data centric
• Simple Deployment



Data People (Data Scientist/
Analysts/BI Devs)
Software Developers
• Code centric
• Heavy on methodologies
• Heavy tooling
• Very complex deployment
“Scientists”
Data scientist
deploying to
production
Making Big Data Teams Scale
• Scaling teams is hard
• Scaling Big Data teams is harder
• Different mentality between data professionals/
engineers
• Mixture of technologies
• Data as integration point
• Often schema-less
• Lack of tools
What Do We Need for Deploying our
apps?
• Source control system: Git, Hg, etc
• A CI process to integrate code run tests and package app
• A repository to store packaged app
• A repository to store configuration
• An API/DSL to configure the underlaying framework
• Mechanism to monitor the behaviour and performance of the app
How can we apply these 

techniques to
Big Data applications?
Who are we?
Software developers with

years of Big Data experience
What do we want?
Simple and robust way to

deploy Big Data pipelines
How will we get it?
Write tens thousands of lines

of code in Scala
Amaterasu - Simple Continuously Deployed
Data Apps
• Big Data apps in Multiple Frameworks
• Multiple Languages
• Scala
• Python
• SQL
• Pipeline deployments are defined as YAML
• Simple to Write, easy to deploy
• Reliable execution
• Multiple Environments
Amaterasu Repositories
• Jobs are defined in repositories
• Current implementation - git repositories
• tarballs support is planned for future release
• Repos structure
• maki.yml - The workflow definition
• src - a folder containing the actions (spark scripts, etc.) to be executed
• env - a folder containing configuration per environment
• deps - dependencies configuration
• Benefits of using git:
• Tooling
• Branching
Pipeline DSL - maki.yml 

(Version 0.2.0)
---
job-name: amaterasu-test
flow:
- name: start
runner:
group: spark
type: scala
file: file.scala
- exports:
odd: parquet
- name: step2
runner:
group: spark
type: pyspark
file: file2.py
error: file2.scala
name: handle-error
runner:
group: spark
type: scala
file: cleanup.scala
…
Data-structures to be used in
downstream actions
Actions are components of 

the pipeline
Error handling actions
Amaterasu is not a workflow engine, 

it’s a deployment tool that understands that Big
Data applications are rarely deployed
independently of other Big Data applications
Pipeline != Workflow
Pipeline DSL (Version 0.3.0)
---
job-name: amaterasu-test
type: long-running
def:
- name: start
type: long-running
runner:
group: spark
type: scala
file: file.scala
- exports:
odd: parquet
- name: step2
type: scheduled
schedule: 10 * * * *
runner:
group: spark
type: pyspark
artifact:
- groupid: io.shonto
artifactId: mySparkStreaming
 version: 0.1.0
…
Scheduling is defined using Cron

format
In Version 3 Pipeline and actions 

can be either long running or 

scheduled
Actions can be pulled from other
application or git repositories
Actions DSL (Spark)
• Your Scala/Python/SQL Future languages Spark
code (R is in the works)
• Few changes:
• Don’t create a new sc/sqlContext, use the one
in scope or access via AmaContext.spark
AmaContext.sc and AmaContext.sqlContext
• AmaContext.getDataFrame is used to access
data from previously executed actions
import org.apache.amaterasu.runtime._
val highNoDf = AmaContext.getDataFrame("start",
“odd")
.where(“_1 > 3")
highNoDf.write.json("file:///tmp/test1")
Actions DSL - Spark Scala
import org.apache.amaterasu.runtime._
val data = Array(1, 2, 3, 4, 5)
val rdd = AmaContext.sc.parallelize(data)
val odd = rdd.filter(n => n%2 != 0).toDf()
Action 1 (“start”) Action 2
- name: start
runner:
group: spark
type: scala
file: file.scala
- exports:
odd: parquet
high_no_df = ama_context

.get_dataframe(“start”, “odd")
.where("_1 > 100”)
high_no_df.write.save(“file:///tmp/test1”, format=“json”)
Actions DSL - PySpark
data = reange(1, 1000)
rdd = ama_context.sc.parallelize(data)
odd = rdd.filter(lambda n: n % 2 != 0)
.map(row)
.toDf()
Action 1 (“start”) Action 2
- name: start
runner:
group: spark
type: pyspark
file: file.py
- exports:
odd: parquet
Actions DSL - SparkSQL
select * from
ama_context.start_odd 

where
_1 > 100
- name: acttion2
runner:
group: spark
type: sql
file: file.sql
- exports:
high_no: parquet
Environments
• Configuration is stored per environment
• Stored as YAML files in an environment folder
• Contains:
• Input/output path
• Work dir
• User defined key-values
env/prduction/job.yml
name: default
master: mesos://prdmsos:5050
inputRootPath: hdfs://prdhdfs:9000/user/amaterasu/input
outputRootPath: hdfs://prdhdfs:9000/user/amaterasu/
output
workingDir: alluxio://prdalluxio:19998/
configuration:
spark.cassandra.connection.host: cassandraprod
sourceTable: documents
env/dev/job.yml
name: test
master: local[*]
inputRootPath: file///tmp/input
outputRootPath: file///tmp/output
workingDir: file///tmp/work/
configuration:
spark.cassandra.connection.host: 127.0.0.1
sourceTable: documents
import io.shinto.amaterasu.runtime._
val highNoDf = AmaContext.getDataFrame("start", “x”)
.where("_1 > 3”)
highNoDf.write.json(Env.outputPath)
Environments in the Actions DSL
Demo time
Version 0.2.0-incubating main futures
• YARN support
• Spark SQL, PySpark support
• Extend environments to support:
• Pure YAML support (configuration used to be JSON)
• Full spark configuration
• spark.yml - support all spark configurations
• spark_exec_env.yml - for configuring spark executors
environments
• SDK Preview - for building framework integration
Future Development
• Long running pipelines and streaming support
• Better tooling
• ama-cli
• Web console
• Other frameworks: Presto, TensorFlow, Apache Flink,
Apache Beam, Hive
• SDK improvements
Website
http://amaterasu.incubator.apache.org
GitHub

https://github.com/apache/incubator-amaterasu
Mailing List
dev@amaterasu.incubator.apache.org
Slack
http://apacheamaterasu.slack.com
Twitter
@ApacheAmaterasu
Getting started
Thank you!
@YRodenski
yaniv@apache.org

Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data Pipelines

  • 1.
    A CD FrameworkFor Data Pipelines Yaniv Rodenski @YRodenski yaniv@apache.org
  • 2.
    Archetypes of DataPipelines Builders • Exploratory workloads • Data centric • Simple Deployment
 
 Data People (Data Scientist/ Analysts/BI Devs) Software Developers • Code centric • Heavy on methodologies • Heavy tooling • Very complex deployment “Scientists”
  • 4.
  • 5.
    Making Big DataTeams Scale • Scaling teams is hard • Scaling Big Data teams is harder • Different mentality between data professionals/ engineers • Mixture of technologies • Data as integration point • Often schema-less • Lack of tools
  • 6.
    What Do WeNeed for Deploying our apps? • Source control system: Git, Hg, etc • A CI process to integrate code run tests and package app • A repository to store packaged app • A repository to store configuration • An API/DSL to configure the underlaying framework • Mechanism to monitor the behaviour and performance of the app
  • 7.
    How can weapply these 
 techniques to Big Data applications?
  • 8.
    Who are we? Softwaredevelopers with
 years of Big Data experience What do we want? Simple and robust way to
 deploy Big Data pipelines How will we get it? Write tens thousands of lines
 of code in Scala
  • 9.
    Amaterasu - SimpleContinuously Deployed Data Apps • Big Data apps in Multiple Frameworks • Multiple Languages • Scala • Python • SQL • Pipeline deployments are defined as YAML • Simple to Write, easy to deploy • Reliable execution • Multiple Environments
  • 10.
    Amaterasu Repositories • Jobsare defined in repositories • Current implementation - git repositories • tarballs support is planned for future release • Repos structure • maki.yml - The workflow definition • src - a folder containing the actions (spark scripts, etc.) to be executed • env - a folder containing configuration per environment • deps - dependencies configuration • Benefits of using git: • Tooling • Branching
  • 11.
    Pipeline DSL -maki.yml 
 (Version 0.2.0) --- job-name: amaterasu-test flow: - name: start runner: group: spark type: scala file: file.scala - exports: odd: parquet - name: step2 runner: group: spark type: pyspark file: file2.py error: file2.scala name: handle-error runner: group: spark type: scala file: cleanup.scala … Data-structures to be used in downstream actions Actions are components of 
 the pipeline Error handling actions
  • 12.
    Amaterasu is nota workflow engine, 
 it’s a deployment tool that understands that Big Data applications are rarely deployed independently of other Big Data applications
  • 13.
  • 14.
    Pipeline DSL (Version0.3.0) --- job-name: amaterasu-test type: long-running def: - name: start type: long-running runner: group: spark type: scala file: file.scala - exports: odd: parquet - name: step2 type: scheduled schedule: 10 * * * * runner: group: spark type: pyspark artifact: - groupid: io.shonto artifactId: mySparkStreaming version: 0.1.0 … Scheduling is defined using Cron
 format In Version 3 Pipeline and actions 
 can be either long running or 
 scheduled Actions can be pulled from other application or git repositories
  • 15.
    Actions DSL (Spark) •Your Scala/Python/SQL Future languages Spark code (R is in the works) • Few changes: • Don’t create a new sc/sqlContext, use the one in scope or access via AmaContext.spark AmaContext.sc and AmaContext.sqlContext • AmaContext.getDataFrame is used to access data from previously executed actions
  • 16.
    import org.apache.amaterasu.runtime._ val highNoDf= AmaContext.getDataFrame("start", “odd") .where(“_1 > 3") highNoDf.write.json("file:///tmp/test1") Actions DSL - Spark Scala import org.apache.amaterasu.runtime._ val data = Array(1, 2, 3, 4, 5) val rdd = AmaContext.sc.parallelize(data) val odd = rdd.filter(n => n%2 != 0).toDf() Action 1 (“start”) Action 2 - name: start runner: group: spark type: scala file: file.scala - exports: odd: parquet
  • 17.
    high_no_df = ama_context
 .get_dataframe(“start”,“odd") .where("_1 > 100”) high_no_df.write.save(“file:///tmp/test1”, format=“json”) Actions DSL - PySpark data = reange(1, 1000) rdd = ama_context.sc.parallelize(data) odd = rdd.filter(lambda n: n % 2 != 0) .map(row) .toDf() Action 1 (“start”) Action 2 - name: start runner: group: spark type: pyspark file: file.py - exports: odd: parquet
  • 18.
    Actions DSL -SparkSQL select * from ama_context.start_odd 
 where _1 > 100 - name: acttion2 runner: group: spark type: sql file: file.sql - exports: high_no: parquet
  • 19.
    Environments • Configuration isstored per environment • Stored as YAML files in an environment folder • Contains: • Input/output path • Work dir • User defined key-values
  • 20.
    env/prduction/job.yml name: default master: mesos://prdmsos:5050 inputRootPath:hdfs://prdhdfs:9000/user/amaterasu/input outputRootPath: hdfs://prdhdfs:9000/user/amaterasu/ output workingDir: alluxio://prdalluxio:19998/ configuration: spark.cassandra.connection.host: cassandraprod sourceTable: documents
  • 21.
    env/dev/job.yml name: test master: local[*] inputRootPath:file///tmp/input outputRootPath: file///tmp/output workingDir: file///tmp/work/ configuration: spark.cassandra.connection.host: 127.0.0.1 sourceTable: documents
  • 22.
    import io.shinto.amaterasu.runtime._ val highNoDf= AmaContext.getDataFrame("start", “x”) .where("_1 > 3”) highNoDf.write.json(Env.outputPath) Environments in the Actions DSL
  • 23.
  • 24.
    Version 0.2.0-incubating mainfutures • YARN support • Spark SQL, PySpark support • Extend environments to support: • Pure YAML support (configuration used to be JSON) • Full spark configuration • spark.yml - support all spark configurations • spark_exec_env.yml - for configuring spark executors environments • SDK Preview - for building framework integration
  • 25.
    Future Development • Longrunning pipelines and streaming support • Better tooling • ama-cli • Web console • Other frameworks: Presto, TensorFlow, Apache Flink, Apache Beam, Hive • SDK improvements
  • 26.
  • 27.