Engineering Machine Learning Data Pipelines
Tracking Data Lineage
Paige Roberts
Integrate Product Marketing Manager
Common Machine Learning Applications
Engineering Machine Learning Data Pipelines
• Anti-money laundering
• Fraud detection
• Cybersecurity
• Targeted marketing
• Recommendation engine
• Next best action
• Customer churn prevention
• Know your customer
2
Data Scientist
Engineering Machine Learning Data Pipelines3
Data Engineer to the Rescue
• Expert in statistical analysis, machine learning
techniques, finding answers to business questions
buried in datasets.
• Does NOT want to spend 50 – 90% of their time
tinkering with data, getting it into good shape to
train models – but frequently does, especially if
there’s no data engineer on their team.
• When machine learning model is trained, tested,
and proven it will accomplish the goal, turns it over
to data engineer to productionize. Not skilled at
taking the model from a test sandbox into
production, especially not at large scale.
• Expert in data structures, data manipulation, and
constructing production data pipelines.
• WANTS to spend all of their time working with data,
but usually has more on their plate than they can
keep up with. Anything that will speed up their work
is helpful.
• In most successful companies, is involved from the
beginning. First gathers, cleans and standardizes
data, helps data scientist with feature engineering,
provides top notch data, ready to train models.
• After model is tested, builds robust high scale, data
pipelines to feed the models the data they need in
the correct format in production to provide ongoing
business value.
Data Engineer
Engineering Machine Learning Data Pipelines4
Five Big Challenges of Engineering ML Data Pipelines
1. Scattered and Difficult to Access Datasets
Much of the necessary data is trapped in mainframes or streams in from POS, web clicks, etc. all in
incompatible formats, making it difficult to gather and prepare the data for model training.
2. Data Cleansing at Scale
Data quality cleansing and preparation routines have to be reproduced at scale. Most data quality tools
are not designed to work on that scale of data.
3. Entity Resolution
Distinguishing matches across massive datasets that indicate a single specific entity (person, company,
product, etc.) requires sophisticated multi-field matching algorithms and a lot of compute power.
Essentially everything has to be compared to everything else.
4. Tracking Lineage from the Source
Data changes made to help train models have to be exactly duplicated in production, in order for models
to accurately make predictions on new data, and for required audit trails. Capture of complete lineage,
from source to end point is needed.
5
End-to-End Data Lineage
Data Sources Data Lake
Data
Onboard data, modify
on-the-fly to match
Hadoop storage model,
or store unchanged for
archive and compliance.
Access data from
streaming and batch
sources outside
cluster.
Transform, join,
cleanse, enhance
data in cluster
with MapReduce
or Spark.
6
End-to-End Data Lineage
Data Sources Data Lake
Data
Onboard data, modify
on-the-fly to match
Hadoop storage model,
or store unchanged for
archive and compliance.
Access data from
streaming and batch
sources outside
cluster.
Transform, join,
cleanse, enhance
data in cluster
with MapReduce
or Spark.
7
End-to-End Data Lineage
Data Sources
Pass source-to-cluster
data lineage info to
Navigator or Atlas.
Data Lake
Data
Data Lineage
REST
API
Onboard data, modify
on-the-fly to match
Hadoop storage model,
or store unchanged for
archive and compliance.
Access data from
streaming and batch
sources outside
cluster.
Transform, join,
cleanse, enhance
data in cluster
with MapReduce
or Spark.
8
End-to-End Data Lineage
Data Sources
Navigator or Atlas
gathers any other
changes made to
data on cluster.
Pass source-to-cluster
data lineage info to
Navigator or Atlas.
Data Lake
Data changes made
by MapReduce,
Spark, HiveQL.
Data
Data Lineage
REST
API
Onboard data, modify
on-the-fly to match
Hadoop storage model,
or store unchanged for
archive and compliance.
Access data from
streaming and batch
sources outside
cluster.
Transform, join,
cleanse, enhance
data in cluster
with MapReduce
or Spark.
9
End-to-End Data Lineage
Data Sources
Navigator or Atlas
gathers any other
changes made to
data on cluster.
Pass source-to-cluster
data lineage info to
Navigator or Atlas.
Data Lake
Data changes made
by MapReduce,
Spark, HiveQL.
Data
Data Lineage
REST
API
Onboard data, modify
on-the-fly to match
Hadoop storage model,
or store unchanged for
archive and compliance.
Access data from
streaming and batch
sources outside
cluster.
Transform, join,
cleanse, enhance
data in cluster
with MapReduce
or Spark.
Auditors
get end-to-end
data lineage.
Analytics,
visualizations, and
machine learning
algorithms get ALL
necessary data.
Analytics,
Visualization,
Machine
Learning
Complete
Data
10
Syncsort Published Lineage in Cloudera Navigator
Engineering Machine Learning Data Pipelines11
Engineering Machine Learning Data Pipelines Series: Tracking Data Lineage from the Source

Engineering Machine Learning Data Pipelines Series: Tracking Data Lineage from the Source

  • 1.
    Engineering Machine LearningData Pipelines Tracking Data Lineage Paige Roberts Integrate Product Marketing Manager
  • 2.
    Common Machine LearningApplications Engineering Machine Learning Data Pipelines • Anti-money laundering • Fraud detection • Cybersecurity • Targeted marketing • Recommendation engine • Next best action • Customer churn prevention • Know your customer 2
  • 3.
    Data Scientist Engineering MachineLearning Data Pipelines3 Data Engineer to the Rescue • Expert in statistical analysis, machine learning techniques, finding answers to business questions buried in datasets. • Does NOT want to spend 50 – 90% of their time tinkering with data, getting it into good shape to train models – but frequently does, especially if there’s no data engineer on their team. • When machine learning model is trained, tested, and proven it will accomplish the goal, turns it over to data engineer to productionize. Not skilled at taking the model from a test sandbox into production, especially not at large scale. • Expert in data structures, data manipulation, and constructing production data pipelines. • WANTS to spend all of their time working with data, but usually has more on their plate than they can keep up with. Anything that will speed up their work is helpful. • In most successful companies, is involved from the beginning. First gathers, cleans and standardizes data, helps data scientist with feature engineering, provides top notch data, ready to train models. • After model is tested, builds robust high scale, data pipelines to feed the models the data they need in the correct format in production to provide ongoing business value. Data Engineer
  • 4.
    Engineering Machine LearningData Pipelines4 Five Big Challenges of Engineering ML Data Pipelines 1. Scattered and Difficult to Access Datasets Much of the necessary data is trapped in mainframes or streams in from POS, web clicks, etc. all in incompatible formats, making it difficult to gather and prepare the data for model training. 2. Data Cleansing at Scale Data quality cleansing and preparation routines have to be reproduced at scale. Most data quality tools are not designed to work on that scale of data. 3. Entity Resolution Distinguishing matches across massive datasets that indicate a single specific entity (person, company, product, etc.) requires sophisticated multi-field matching algorithms and a lot of compute power. Essentially everything has to be compared to everything else. 4. Tracking Lineage from the Source Data changes made to help train models have to be exactly duplicated in production, in order for models to accurately make predictions on new data, and for required audit trails. Capture of complete lineage, from source to end point is needed.
  • 5.
    5 End-to-End Data Lineage DataSources Data Lake Data Onboard data, modify on-the-fly to match Hadoop storage model, or store unchanged for archive and compliance. Access data from streaming and batch sources outside cluster. Transform, join, cleanse, enhance data in cluster with MapReduce or Spark.
  • 6.
    6 End-to-End Data Lineage DataSources Data Lake Data Onboard data, modify on-the-fly to match Hadoop storage model, or store unchanged for archive and compliance. Access data from streaming and batch sources outside cluster. Transform, join, cleanse, enhance data in cluster with MapReduce or Spark.
  • 7.
    7 End-to-End Data Lineage DataSources Pass source-to-cluster data lineage info to Navigator or Atlas. Data Lake Data Data Lineage REST API Onboard data, modify on-the-fly to match Hadoop storage model, or store unchanged for archive and compliance. Access data from streaming and batch sources outside cluster. Transform, join, cleanse, enhance data in cluster with MapReduce or Spark.
  • 8.
    8 End-to-End Data Lineage DataSources Navigator or Atlas gathers any other changes made to data on cluster. Pass source-to-cluster data lineage info to Navigator or Atlas. Data Lake Data changes made by MapReduce, Spark, HiveQL. Data Data Lineage REST API Onboard data, modify on-the-fly to match Hadoop storage model, or store unchanged for archive and compliance. Access data from streaming and batch sources outside cluster. Transform, join, cleanse, enhance data in cluster with MapReduce or Spark.
  • 9.
    9 End-to-End Data Lineage DataSources Navigator or Atlas gathers any other changes made to data on cluster. Pass source-to-cluster data lineage info to Navigator or Atlas. Data Lake Data changes made by MapReduce, Spark, HiveQL. Data Data Lineage REST API Onboard data, modify on-the-fly to match Hadoop storage model, or store unchanged for archive and compliance. Access data from streaming and batch sources outside cluster. Transform, join, cleanse, enhance data in cluster with MapReduce or Spark. Auditors get end-to-end data lineage. Analytics, visualizations, and machine learning algorithms get ALL necessary data. Analytics, Visualization, Machine Learning Complete Data
  • 10.
    10 Syncsort Published Lineagein Cloudera Navigator
  • 11.