Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data Pipelines

A CD Framework For Data Pipelines
Yaniv Rodenski
@YRodenski
yaniv@apache.org

Archetypes of Data Pipelines Builders
• Exploratory workloads
• Data centric
• Simple Deployment 
 
Data People (Data Scientist/
Analysts/BI Devs)
Software Developers
• Code centric
• Heavy on methodologies
• Heavy tooling
• Very complex deployment
“Scientists”

Data scientist
deploying to
production

Making Big Data Teams Scale
• Scaling teams is hard
• Scaling Big Data teams is harder
• Different mentality between data professionals/
engineers
• Mixture of technologies
• Data as integration point
• Often schema-less
• Lack of tools

What Do We Need for Deploying our
apps?
• Source control system: Git, Hg, etc
• A CI process to integrate code run tests and package app
• A repository to store packaged app
• A repository to store conﬁguration
• An API/DSL to conﬁgure the underlaying framework
• Mechanism to monitor the behaviour and performance of the app

How can we apply these  
techniques to
Big Data applications?

Who are we?
Software developers with 
years of Big Data experience
What do we want?
Simple and robust way to 
deploy Big Data pipelines
How will we get it?
Write tens thousands of lines 
of code in Scala

Amaterasu - Simple Continuously Deployed
Data Apps
• Big Data apps in Multiple Frameworks
• Multiple Languages
• Scala
• Python
• SQL
• Pipeline deployments are deﬁned as YAML
• Simple to Write, easy to deploy
• Reliable execution
• Multiple Environments

Amaterasu Repositories
• Jobs are defined in repositories
• Current implementation - git repositories
• tarballs support is planned for future release
• Repos structure
• maki.yml - The workflow definition
• src - a folder containing the actions (spark scripts, etc.) to be executed
• env - a folder containing configuration per environment
• deps - dependencies configuration
• Benefits of using git:
• Tooling
• Branching

Pipeline DSL - maki.yml  
(Version 0.2.0)
---
job-name: amaterasu-test
flow:
- name: start
runner:
group: spark
type: scala
file: file.scala
- exports:
odd: parquet
- name: step2
runner:
group: spark
type: pyspark
file: file2.py
error: file2.scala
name: handle-error
runner:
group: spark
type: scala
file: cleanup.scala
…
Data-structures to be used in
downstream actions
Actions are components of  
the pipeline
Error handling actions

Amaterasu is not a workﬂow engine,  
it’s a deployment tool that understands that Big
Data applications are rarely deployed
independently of other Big Data applications

Pipeline DSL (Version 0.3.0)
---
job-name: amaterasu-test
type: long-running
def:
- name: start
type: long-running
runner:
group: spark
type: scala
file: file.scala
- exports:
odd: parquet
- name: step2
type: scheduled
schedule: 10 * * * *
runner:
group: spark
type: pyspark
artifact:
- groupid: io.shonto
artifactId: mySparkStreaming
version: 0.1.0
…
Scheduling is defined using Cron 
format
In Version 3 Pipeline and actions  
can be either long running or  
scheduled
Actions can be pulled from other
application or git repositories

Actions DSL (Spark)
• Your Scala/Python/SQL Future languages Spark
code (R is in the works)
• Few changes:
• Don’t create a new sc/sqlContext, use the one
in scope or access via AmaContext.spark
AmaContext.sc and AmaContext.sqlContext
• AmaContext.getDataFrame is used to access
data from previously executed actions

import org.apache.amaterasu.runtime._
val highNoDf = AmaContext.getDataFrame("start",
“odd")
.where(“_1 > 3")
highNoDf.write.json("file:///tmp/test1")
Actions DSL - Spark Scala
import org.apache.amaterasu.runtime._
val data = Array(1, 2, 3, 4, 5)
val rdd = AmaContext.sc.parallelize(data)
val odd = rdd.filter(n => n%2 != 0).toDf()
Action 1 (“start”) Action 2
- name: start
runner:
group: spark
type: scala
file: file.scala
- exports:
odd: parquet

high_no_df = ama_context 
.get_dataframe(“start”, “odd")
.where("_1 > 100”)
high_no_df.write.save(“file:///tmp/test1”, format=“json”)
Actions DSL - PySpark
data = reange(1, 1000)
rdd = ama_context.sc.parallelize(data)
odd = rdd.filter(lambda n: n % 2 != 0)
.map(row)
.toDf()
Action 1 (“start”) Action 2
- name: start
runner:
group: spark
type: pyspark
file: file.py
- exports:
odd: parquet

Actions DSL - SparkSQL
select * from
ama_context.start_odd  
where
_1 > 100
- name: acttion2
runner:
group: spark
type: sql
ﬁle: ﬁle.sql
- exports:
high_no: parquet

Environments
• Configuration is stored per environment
• Stored as YAML files in an environment folder
• Contains:
• Input/output path
• Work dir
• User defined key-values

env/prduction/job.yml
name: default
master: mesos://prdmsos:5050
inputRootPath: hdfs://prdhdfs:9000/user/amaterasu/input
outputRootPath: hdfs://prdhdfs:9000/user/amaterasu/
output
workingDir: alluxio://prdalluxio:19998/
configuration:
spark.cassandra.connection.host: cassandraprod
sourceTable: documents

env/dev/job.yml
name: test
master: local[*]
inputRootPath: file///tmp/input
outputRootPath: file///tmp/output
workingDir: file///tmp/work/
configuration:
spark.cassandra.connection.host: 127.0.0.1
sourceTable: documents

import io.shinto.amaterasu.runtime._
val highNoDf = AmaContext.getDataFrame("start", “x”)
.where("_1 > 3”)
highNoDf.write.json(Env.outputPath)
Environments in the Actions DSL

Version 0.2.0-incubating main futures
• YARN support
• Spark SQL, PySpark support
• Extend environments to support:
• Pure YAML support (configuration used to be JSON)
• Full spark configuration
• spark.yml - support all spark configurations
• spark_exec_env.yml - for configuring spark executors
environments
• SDK Preview - for building framework integration

Future Development
• Long running pipelines and streaming support
• Better tooling
• ama-cli
• Web console
• Other frameworks: Presto, TensorFlow, Apache Flink,
Apache Beam, Hive
• SDK improvements

Website
http://amaterasu.incubator.apache.org
GitHub 
https://github.com/apache/incubator-amaterasu
Mailing List
dev@amaterasu.incubator.apache.org
Slack
http://apacheamaterasu.slack.com
Twitter
@ApacheAmaterasu
Getting started

Thank you!
@YRodenski
yaniv@apache.org

Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data Pipelines

More Related Content

What's hot

Similar to Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data Pipelines

More from DataWorks Summit

Recently uploaded

Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data Pipelines