𝜆
Open Source
-Architecture for Deep Learning
Use case
Patrick R Nicolas
Oct. 2020
pnicolasai@yahoo.com
Overview
3
“… and the wise man said,
thou shall embrace open source”.
21st century proverb
Patrick R. Nicolas - Open Source 𝜆 -Architecture for Deep Learning
Overview
4
Overview
Layers
Open-source components
Patrick R. Nicolas - Open Source 𝜆 -Architecture for Deep Learning
Overview
5
The world of data scientists accustomed to Python
scientific libraries have been shaken up by the
emergence of ’big data’ framework such as Apache
Hadoop, Spark and Kafka.
This presentation introduces a variant of the
architecture and describes the seamless integration of
various open source components to train, validate and
test deep learning models.
Patrick R. Nicolas - Open Source 𝜆 -Architecture for Deep Learning
𝜆
Disclaimer
6
The concept and architecture are versatile enough to
accommodate a variety of open source, commercial
solutions and services beside the frameworks
prescribed in this presentation.
For instance, deep learning frameworks, such as Keras
or tensor flow are excellent alternatives to PyTorch.
Patrick R. Nicolas - Open Source 𝜆 -Architecture for Deep Learning
Requirements
7
• Process batch and stream data, concurrently
• Enforce data immutability
• Recover gracefully from human errors
• Handle hardware failures
• Minimize latency for real-time requests
• Scale for very large data set
• Optimize full lifecycle of data set
• Guarantee quality and integrity of data
Patrick R. Nicolas - Open Source 𝜆 -Architecture for Deep Learning
A ‘big data’ framework should be able to ….
Optimizing data life cycle
8
Patrick R. Nicolas - Open Source 𝜆 -Architecture for Deep Learning
The need for optimizing the data life cycle: 79% of data
scientist time is spent collecting and organizing data.
Source Quora
Data quality
9
Accuracy: Correct models and representative data.
Completeness: No missing data
Consistency: Applied to semantic and format
Timeliness: Up-to-date data and notification
Accessibility: Ease of use and high availability
Validity: Comply to constraints, rules and regulations
Patrick R. Nicolas - Open Source 𝜆 -Architecture for Deep Learning
Guaranteeing data quality and integrity
Solution …
10
- architecture is a large scale data processing that
balanced batch and real-time streamed data.
It is a one-stop shopping for various data sources that
balance latency, redundancy, easy of access and
throughput.
It breaks down into 3 layers
• Speed (streaming, real-time, …)
• Batch (training, analysis, …)
• Serving (query, visualization, …)
Patrick R. Nicolas - Open Source 𝜆 -Architecture for Deep Learning
𝜆
… using open source
11
Patrick R. Nicolas - Open Source 𝜆 -Architecture for Deep Learning
architecture using open source components?
𝜆
The task consists of reviewing and evaluating the trove
of available of open source libraries to build a robust
architecture that support the rigor of training and
tuning deep learning models.
The libraries are weaved through a set language-
agnostic REST API to form a coherent pipeline.
… for deep learning
12
• Python scientific libraries have been the go-to tools
for data scientists to analyze data and build models.
• PyTorch framework builds up on these libraries to
support the design and execution of deep learning
models.
• Apache Spark and Kafka complements these
frameworks for very large data set and real-time
processing.
Patrick R. Nicolas - Open Source 𝜆 -Architecture for Deep Learning
architecture for deep learning?
𝜆
Bird-eye view
13
Patrick R. Nicolas - Open Source 𝜆 -Architecture for Deep Learning
Feel overwhelmed?
... Let’s break it down
Example open source
𝜆 architecture
Layers
14
Overview
Layers
Open-source components
Patrick R. Nicolas - Open Source 𝜆 -Architecture for Deep Learning
Batch layer
15
Patrick R. Nicolas - Open Source 𝜆 -Architecture for Deep Learning
Batch layer objective: load batch of data to be distributed,
preprocessed to train deep learning models.
Batch layer
16
Patrick R. Nicolas - Open Source 𝜆 -Architecture for Deep Learning
Typical use case:
1. Apache Spark loads training set from Amazon S3
2. Spark master partitions training data
3. Spark workers preprocessed data and notify
completion through Kafka event queue
4. Pytorch updated model parameters from pre-
processed training data
5. Pytorch broadcast model parameters and quality
metrics through Kafka
6. Apache Hive powered by Spark stores models related
data and metrics
Speed layer
17
Patrick R. Nicolas - Open Source 𝜆 -Architecture for Deep Learning
Speed layer objective: process queries to predictive
models with very low latency.
Speed layer
18
Patrick R. Nicolas - Open Source 𝜆 -Architecture for Deep Learning
Use case:
1. Kafka routes data streams to Spark master
2. Spark pre-processes requests and forward them to
deep model micro-service
3. Flask converts requests to prediction query to Pytorch
model
4. Pytorch model generate a prediction
5. Run-time metrics are broadcast through Kafka
Serving layer
19
Patrick R. Nicolas - Open Source 𝜆 -Architecture for Deep Learning
Serving layer objective: process queries to analyze data,
model performances and execute statistical inference
Serving layer
20
Patrick R. Nicolas - Open Source 𝜆 -Architecture for Deep Learning
Use case:
1. Analyst queries relational data base, MySQL for most
recent data, statistics using Fine report UI (low
latency)
2. Analyst queries asynchronously Hive data warehouse
for archived data, statistics (high latency)
3. Hive processes queries through Spark datasets
4. Spark updates regularly MySQL short term data
Overview
21
Overview
Layers
Open-source components
Patrick R. Nicolas - Open Source 𝜆 -Architecture for Deep Learning
PyTorch
22
PyTorch is an optimized tensor library for deep
learning using GPUs and CPUs.
It extends the functionality of Numpy and Scikit-
learn to support the training, evaluation and
commercialization of complex machine learning
models.
https://pytorch.org/tutorials/
Alternatives:
Tensor flow: https://www.tensorflow.org/
Keras: https://keras.io
MxNet: https://mxnet.apache.org
Patrick R. Nicolas - Open Source 𝜆 -Architecture for Deep Learning
Apache Spark
23
Apache Spark is an open source cluster computing
framework for fast real-time processing.
It supports Scala, Java, Python and R programming
languages and includes streaming, graph and machine
learning libraries.
https://www.scala-lang.org
https://spark.apache.org
Alternative:
PySpark: https://databricks.com/glossary/pyspark
Patrick R. Nicolas - Open Source 𝜆 -Architecture for Deep Learning
Streaming
24
Apache Kafka is an open-source distributed event
streaming framework to large scale, real-time data
processing and analytics.
It captures data from various sources in real-time as a
continuous flow and routes it to the appropriate
processor.
https://kafka.apache.org
Alternatives:
Amazon SQS: https://aws.amazon.com/sqs/
RabbitMQ: https://www.rabbitmq.com
Patrick R. Nicolas - Open Source 𝜆 -Architecture for Deep Learning
Model tuning
25
Ray-tune is a distributed hyper-parameters
tuning framework particularly suitable to deep learning
models.
It reduces significantly the cost of optimizing the
configuration of a model. It is a wrapper around other
open source libraries
https://docs.ray.io/en/master/tune/index.html
Alternatives:
Amazon SageMaker: https://aws.amazon.com/sagemaker/
HyperOpt: https://github.com/hyperopt/hyperopt
Optuna: https://optuna.readthedocs.io
Patrick R. Nicolas - Open Source 𝜆 -Architecture for Deep Learning
Python REST service
26
Patrick R. Nicolas - Open Source 𝜆 -Architecture for Deep Learning
Flask is an easy to use implementation of the
RESTful interface to Python applications.
It supports most of web and deployment standards
such Docker, React.js, Angular, HTML5 and WSGI
containers.
https://palletsprojects.com/p/flask/
Alternatives:
Falcon: https://falcon.readthedocs.io
Fast API: https://fastapi.tiangolo.com
RDBMS
27
Patrick R. Nicolas - Open Source 𝜆 -Architecture for Deep Learning
MySQL is an open source relational database
supporting partitioning, sharding, replication. It can
be extended with real-time analytics (Heatwave)
and enterprise clustering (CGE)
https://www.mysql.com
Alternatives:
PosgresSQL: https://www.postgresql.org
HyperSQL http://www.hsqldb.org
Amazon RDS: http://aws.amazon.com/rds
Data warehouse
28
Patrick R. Nicolas - Open Source 𝜆 -Architecture for Deep Learning
Apache Hive is a data warehouse framework that
leverages Spark to execute largely distributed SQL
queries.
It optimizes SQL queries through lazy evaluation of
acyclic execution graph. It is integrated with
Spark data set and HDFS.
https://hive.apache.org
Alternatives:
Vertica http://www.vertica.com
Amazon Redshift https://aws.amazon.com/redshift/
Dashboard
29
Patrick R. Nicolas - Open Source 𝜆 -Architecture for Deep Learning
Fine report is a business intelligence and
dashboard tool that supports real time analytics,
reporting and visualization. It accomodates needs
of business managers and data scientists
https://www.finereport.com
Alternatives:
Sisense: https://www.sisense.com
Tableau: https://www.tableau.com
30
Patrick R. Nicolas - Open Source 𝜆 -Architecture for Deep Learning
Final disclaimer
This presentation is not an endorsement of the various
tools, libraries or frameworks described or suggested in
this presentation.
Allthough the tools listed in the slides are known to work
in the context of the architecture, there are excellent
alternative libraries that may better meet your specific
needs.
31
Patrick R. Nicolas - Open Source 𝜆 -Architecture for Deep Learning
Thank you!
Q&A

Open Source Lambda Architecture for deep learning

  • 1.
    𝜆 Open Source -Architecture forDeep Learning Use case Patrick R Nicolas Oct. 2020 pnicolasai@yahoo.com
  • 2.
    Overview 3 “… and thewise man said, thou shall embrace open source”. 21st century proverb Patrick R. Nicolas - Open Source 𝜆 -Architecture for Deep Learning
  • 3.
    Overview 4 Overview Layers Open-source components Patrick R.Nicolas - Open Source 𝜆 -Architecture for Deep Learning
  • 4.
    Overview 5 The world ofdata scientists accustomed to Python scientific libraries have been shaken up by the emergence of ’big data’ framework such as Apache Hadoop, Spark and Kafka. This presentation introduces a variant of the architecture and describes the seamless integration of various open source components to train, validate and test deep learning models. Patrick R. Nicolas - Open Source 𝜆 -Architecture for Deep Learning 𝜆
  • 5.
    Disclaimer 6 The concept andarchitecture are versatile enough to accommodate a variety of open source, commercial solutions and services beside the frameworks prescribed in this presentation. For instance, deep learning frameworks, such as Keras or tensor flow are excellent alternatives to PyTorch. Patrick R. Nicolas - Open Source 𝜆 -Architecture for Deep Learning
  • 6.
    Requirements 7 • Process batchand stream data, concurrently • Enforce data immutability • Recover gracefully from human errors • Handle hardware failures • Minimize latency for real-time requests • Scale for very large data set • Optimize full lifecycle of data set • Guarantee quality and integrity of data Patrick R. Nicolas - Open Source 𝜆 -Architecture for Deep Learning A ‘big data’ framework should be able to ….
  • 7.
    Optimizing data lifecycle 8 Patrick R. Nicolas - Open Source 𝜆 -Architecture for Deep Learning The need for optimizing the data life cycle: 79% of data scientist time is spent collecting and organizing data. Source Quora
  • 8.
    Data quality 9 Accuracy: Correctmodels and representative data. Completeness: No missing data Consistency: Applied to semantic and format Timeliness: Up-to-date data and notification Accessibility: Ease of use and high availability Validity: Comply to constraints, rules and regulations Patrick R. Nicolas - Open Source 𝜆 -Architecture for Deep Learning Guaranteeing data quality and integrity
  • 9.
    Solution … 10 - architectureis a large scale data processing that balanced batch and real-time streamed data. It is a one-stop shopping for various data sources that balance latency, redundancy, easy of access and throughput. It breaks down into 3 layers • Speed (streaming, real-time, …) • Batch (training, analysis, …) • Serving (query, visualization, …) Patrick R. Nicolas - Open Source 𝜆 -Architecture for Deep Learning 𝜆
  • 10.
    … using opensource 11 Patrick R. Nicolas - Open Source 𝜆 -Architecture for Deep Learning architecture using open source components? 𝜆 The task consists of reviewing and evaluating the trove of available of open source libraries to build a robust architecture that support the rigor of training and tuning deep learning models. The libraries are weaved through a set language- agnostic REST API to form a coherent pipeline.
  • 11.
    … for deeplearning 12 • Python scientific libraries have been the go-to tools for data scientists to analyze data and build models. • PyTorch framework builds up on these libraries to support the design and execution of deep learning models. • Apache Spark and Kafka complements these frameworks for very large data set and real-time processing. Patrick R. Nicolas - Open Source 𝜆 -Architecture for Deep Learning architecture for deep learning? 𝜆
  • 12.
    Bird-eye view 13 Patrick R.Nicolas - Open Source 𝜆 -Architecture for Deep Learning Feel overwhelmed? ... Let’s break it down Example open source 𝜆 architecture
  • 13.
    Layers 14 Overview Layers Open-source components Patrick R.Nicolas - Open Source 𝜆 -Architecture for Deep Learning
  • 14.
    Batch layer 15 Patrick R.Nicolas - Open Source 𝜆 -Architecture for Deep Learning Batch layer objective: load batch of data to be distributed, preprocessed to train deep learning models.
  • 15.
    Batch layer 16 Patrick R.Nicolas - Open Source 𝜆 -Architecture for Deep Learning Typical use case: 1. Apache Spark loads training set from Amazon S3 2. Spark master partitions training data 3. Spark workers preprocessed data and notify completion through Kafka event queue 4. Pytorch updated model parameters from pre- processed training data 5. Pytorch broadcast model parameters and quality metrics through Kafka 6. Apache Hive powered by Spark stores models related data and metrics
  • 16.
    Speed layer 17 Patrick R.Nicolas - Open Source 𝜆 -Architecture for Deep Learning Speed layer objective: process queries to predictive models with very low latency.
  • 17.
    Speed layer 18 Patrick R.Nicolas - Open Source 𝜆 -Architecture for Deep Learning Use case: 1. Kafka routes data streams to Spark master 2. Spark pre-processes requests and forward them to deep model micro-service 3. Flask converts requests to prediction query to Pytorch model 4. Pytorch model generate a prediction 5. Run-time metrics are broadcast through Kafka
  • 18.
    Serving layer 19 Patrick R.Nicolas - Open Source 𝜆 -Architecture for Deep Learning Serving layer objective: process queries to analyze data, model performances and execute statistical inference
  • 19.
    Serving layer 20 Patrick R.Nicolas - Open Source 𝜆 -Architecture for Deep Learning Use case: 1. Analyst queries relational data base, MySQL for most recent data, statistics using Fine report UI (low latency) 2. Analyst queries asynchronously Hive data warehouse for archived data, statistics (high latency) 3. Hive processes queries through Spark datasets 4. Spark updates regularly MySQL short term data
  • 20.
    Overview 21 Overview Layers Open-source components Patrick R.Nicolas - Open Source 𝜆 -Architecture for Deep Learning
  • 21.
    PyTorch 22 PyTorch is anoptimized tensor library for deep learning using GPUs and CPUs. It extends the functionality of Numpy and Scikit- learn to support the training, evaluation and commercialization of complex machine learning models. https://pytorch.org/tutorials/ Alternatives: Tensor flow: https://www.tensorflow.org/ Keras: https://keras.io MxNet: https://mxnet.apache.org Patrick R. Nicolas - Open Source 𝜆 -Architecture for Deep Learning
  • 22.
    Apache Spark 23 Apache Sparkis an open source cluster computing framework for fast real-time processing. It supports Scala, Java, Python and R programming languages and includes streaming, graph and machine learning libraries. https://www.scala-lang.org https://spark.apache.org Alternative: PySpark: https://databricks.com/glossary/pyspark Patrick R. Nicolas - Open Source 𝜆 -Architecture for Deep Learning
  • 23.
    Streaming 24 Apache Kafka isan open-source distributed event streaming framework to large scale, real-time data processing and analytics. It captures data from various sources in real-time as a continuous flow and routes it to the appropriate processor. https://kafka.apache.org Alternatives: Amazon SQS: https://aws.amazon.com/sqs/ RabbitMQ: https://www.rabbitmq.com Patrick R. Nicolas - Open Source 𝜆 -Architecture for Deep Learning
  • 24.
    Model tuning 25 Ray-tune isa distributed hyper-parameters tuning framework particularly suitable to deep learning models. It reduces significantly the cost of optimizing the configuration of a model. It is a wrapper around other open source libraries https://docs.ray.io/en/master/tune/index.html Alternatives: Amazon SageMaker: https://aws.amazon.com/sagemaker/ HyperOpt: https://github.com/hyperopt/hyperopt Optuna: https://optuna.readthedocs.io Patrick R. Nicolas - Open Source 𝜆 -Architecture for Deep Learning
  • 25.
    Python REST service 26 PatrickR. Nicolas - Open Source 𝜆 -Architecture for Deep Learning Flask is an easy to use implementation of the RESTful interface to Python applications. It supports most of web and deployment standards such Docker, React.js, Angular, HTML5 and WSGI containers. https://palletsprojects.com/p/flask/ Alternatives: Falcon: https://falcon.readthedocs.io Fast API: https://fastapi.tiangolo.com
  • 26.
    RDBMS 27 Patrick R. Nicolas- Open Source 𝜆 -Architecture for Deep Learning MySQL is an open source relational database supporting partitioning, sharding, replication. It can be extended with real-time analytics (Heatwave) and enterprise clustering (CGE) https://www.mysql.com Alternatives: PosgresSQL: https://www.postgresql.org HyperSQL http://www.hsqldb.org Amazon RDS: http://aws.amazon.com/rds
  • 27.
    Data warehouse 28 Patrick R.Nicolas - Open Source 𝜆 -Architecture for Deep Learning Apache Hive is a data warehouse framework that leverages Spark to execute largely distributed SQL queries. It optimizes SQL queries through lazy evaluation of acyclic execution graph. It is integrated with Spark data set and HDFS. https://hive.apache.org Alternatives: Vertica http://www.vertica.com Amazon Redshift https://aws.amazon.com/redshift/
  • 28.
    Dashboard 29 Patrick R. Nicolas- Open Source 𝜆 -Architecture for Deep Learning Fine report is a business intelligence and dashboard tool that supports real time analytics, reporting and visualization. It accomodates needs of business managers and data scientists https://www.finereport.com Alternatives: Sisense: https://www.sisense.com Tableau: https://www.tableau.com
  • 29.
    30 Patrick R. Nicolas- Open Source 𝜆 -Architecture for Deep Learning Final disclaimer This presentation is not an endorsement of the various tools, libraries or frameworks described or suggested in this presentation. Allthough the tools listed in the slides are known to work in the context of the architecture, there are excellent alternative libraries that may better meet your specific needs.
  • 30.
    31 Patrick R. Nicolas- Open Source 𝜆 -Architecture for Deep Learning Thank you! Q&A