Robert Crowe
ML
Code
Configuration
Data Collection
Data
Verification
Feature Extraction
Process Management
Tools
Analysis Tools
Machine
Resource
Management
Serving
Infrastructure
Monitoring
ML
Code
Data
Ingestion
Data
Analysis + Validation
Feature
Engineering
Trainer
Model Evaluation
and Validation
Serving Logging
Shared Utilities for Garbage Collection, Data Access Controls
Pipeline Storage
Tuner
Shared Configuration Framework and Job Orchestration
Integrated Frontend for Job Management, Monitoring, Debugging, Data/Model/Evaluation Visualization
An ML pipeline is part of the solution to this problem
Data
Ingestion
TensorFlow
Data Validation
TensorFlow
Transform
Estimator
or Keras
Model
TensorFlow
Model Analysis
TensorFlow
Serving
Logging
Shared Utilities for Garbage Collection, Data Access Controls
Pipeline Storage
Tuner
Shared Configuration Framework and Job Orchestration
Integrated Frontend for Job Management, Monitoring, Debugging, Data/Model/Evaluation Visualization
TensorFlow Extended (TFX) is an end-to-end ML
pipeline for TensorFlow
(incl. )
Major ProductsAlphaBets
● A unified batch and stream distributed processing API
● A set of SDK frontends: Java, Python, Go, Scala, SQL, …
● A set of runners which can execute Beam jobs into
various backends: Local, Apache Flink, Apache Spark,
Apache Gearpump, Apache Samza, Apache Hadoop,
Google Cloud Dataflow, …
Provides a comprehensive portability
framework for data processing pipelines, which
allows you to write your pipeline once in your
language of choice and run it with minimal effort
on your execution engine of choice.
Data Ingestion
TensorFlow
Transform
Estimator Model
TensorFlow
Model Analysis
Honoring
Validation
Outcomes
TensorFlow
Data Validation
TensorFlow
Serving
ExampleGen
StatisticsGen
SchemaGen
Example
Validator
Transform Trainer
Evaluator
Model
Validator
Pusher Model Server
Powered by Beam Powered by Beam
Model
Validator
Packaged binary
or container
Last Validated
Model
New (Candidate)
Model
Model
Validator
Validation
Outcome
Well defined
inputs and outputs
Config
Last Validated
Model
New (Candidate)
Model
Model
Validator
Validation
Outcome
Well defined
configuration
Metadata Store
Config
Last Validated
Model
New (Candidate)
Model
Model
Validator
Validation
Outcome
Context
Metadata Store
Trainer
Config
Last Validated
Model
New (Candidate)
Model
New Model
Model
Validator
Validation
Outcome
Pusher
New (Candidate)
Model
Validation
Outcome
Deployment targets:
TensorFlow Serving
TensorFlow Lite
TensorFlow JS
TensorFlow Hub
Trainer
Task-Aware Pipelines
Transform
Trainer
Task-Aware Pipelines
Input Data
Transformed
Data
Trained
Models
Deployment
Task- and Data-Aware Pipelines
Pipeline + Metadata Storage
Training Data
Transform TrainerTransform
Trained
Models
Type definitions of Artifacts and their Properties
E.g., Models, Data, Evaluation Metrics
Trained
Models
Type definitions of Artifacts and their Properties
E.g., Models, Data, Evaluation Metrics
Trainer Execution Records (Runs) of Components
E.g., Runtime Configuration, Inputs + Outputs
Trained
Models
Type definitions of Artifacts and their Properties
E.g., Models, Data, Evaluation Metrics
Trainer Execution Records (Runs) of Components
E.g., Runtime Configuration, Inputs + Outputs
Lineage Tracking Across All Executions
E.g., to recurse back to all inputs of a specific artifact
Model artifact
that was created
Use-cases enabled by lineage tracking
Use-cases enabled by lineage tracking Compare previous model runs
Use-cases enabled by lineage tracking Compare previous model runs
Carry-over state from previous models
Use-cases enabled by lineage tracking Compare previous model runs
Carry-over state from previous models Re-use previously computed outputs
Component Component Component
Component
Legend
Component
Legend
Component
Driver
Metadata Store
Component Component
Publisher
Driver and Publisher
Driver
Publisher
Driver
Publisher
Component
Executor
Legend
Driver and Publisher
Driver
Metadata Store
Publisher
Driver
Publisher
Driver
Publisher
Executor Executor Executor
TFX Config
Component
Executor
Legend
Driver and Publisher
Driver
Metadata Store
Publisher
Driver
Publisher
Driver
Publisher
Executor Executor Executor
Metadata Store
Driver
Transform,
etc.
Publisher
Flink Dataflow
Beam
Metadata Store
Driver
Trainer
Publisher
TensorFlow
Metadata Store
Driver
Pusher, etc.
Publisher
def create_pipeline():
"""Implements the chicago taxi pipeline with TFX."""
examples = csv_input(os.path.join(data_root, 'simple'))
example_gen = CsvExampleGen(input_base=examples)
statistics_gen = StatisticsGen(input_data=...)
infer_schema = SchemaGen(stats=...)
validate_stats = ExampleValidator(stats=..., schema=...)
# Performs transformations and feature engineering in training and serving
transform = Transform(
input_data=example_gen.outputs.examples,
schema=infer_schema.outputs.output,
module_file=_taxi_module_file)
trainer = Trainer(...)
model_analyzer = Evaluator(examples=..., model_exports=...)
model_validator = ModelValidator(examples=..., model=...)
pusher = Pusher(model_export=..., model_blessing=..., serving_model_dir=...)
return [example_gen, statistics_gen, infer_schema, validate_stats, transform, trainer,
model_analyzer, model_validator, pusher]
pipeline = AirflowDAGRunner(_airflow_config).run(_create_pipeline())
class Executor(base_executor.BaseExecutor):
"""Generic TFX statsgen executor."""
...
def Do(...) -> None:
"""Computes stats for each split of input using tensorflow_data_validation.
...
with beam.Pipeline(argv=self._get_beam_pipeline_args()) as p:
for split, instance in split_to_instance.items():
...
output_path = os.path.join(output_uri, _DEFAULT_FILE_NAME)
_ = (
p | 'ReadData.' + split >> beam.io.ReadFromTFRecord(file_pattern=input_uri)
| 'DecodeData.' + split >> tf_example_decoder.DecodeTFExample()
| 'GenerateStatistics.' + split >> stats_api.GenerateStatistics(stats_options)
| 'WriteStatsOutput.' + split >> beam.io.WriteToTFRecord(
output_path,shard_name_template='',
coder=beam.coders.ProtoCoder(
statistics_pb2.DatasetFeatureStatisticsList)))
tf.logging.info('Statistics written to {}.'.format(output_uri))
def preprocessing_fn(inputs):
with beam.Pipeline() as pipeline:
...
raw_data = (
pipeline
| 'ReadTrainData' >> beam.io.ReadFromText(train_data_file)
| 'FixCommasTrainData' >> beam.Map(
lambda line: line.replace(', ', ','))
| 'DecodeTrainData' >> MapAndFilterErrors(converter.decode))
transformed_dataset, transform_fn = (
raw_dataset | tft_beam.AnalyzeAndTransformDataset(preprocessing_fn))
...
return outputs
TFX Config
Component
Executor
Legend
Metadata Store
Driver and Publisher
Your own runtime ...Kubeflow Runtime
Metadata Store
Component
Driver and Publisher
Executor
Legend
Airflow Runtime
TFX Config
Airflow Kubeflow Pipelines
Kubeflow Runtime
ExampleGen
StatisticsGen SchemaGen
Example
Validator
Transform Trainer
Evaluator
Model
Validator
Pusher
TFX Config
Metadata Store
Training +
Eval Data
TensorFlow
Serving
TensorFlow
Hub
TensorFlow
Lite
TensorFlow
JS
TFX: Putting it all together.
Airflow Runtime
Component: ExampleGen
examples = csv_input(os.path.join(data_root, 'simple'))
example_gen = CsvExampleGen(input_base=examples)
Configuration
Example
Gen
Raw Data
Inputs and Outputs
CSV TF Record
Split
TF Record
Data
Training
Eval
Component: StatisticsGen
statistics_gen =
StatisticsGen(input_data=example_gen.outputs.examples)
Configuration
Visualization
StatisticsGen
Data
ExampleGen
Inputs and Outputs
Statistics
Analyzing Data with TensorFlow Data Validation
Component: SchemaGen
SchemaGen
Statistics
StatisticsGen
Inputs and Outputs
Schema
infer_schema = SchemaGen(stats=statistics_gen.outputs.output)
Configuration
Visualization
Component: ExampleValidator
Example
Validator
Statistics Schema
StatisticsGen SchemaGen
Inputs and Outputs
Anomalies
Report
validate_stats = ExampleValidator(
stats=statistics_gen.outputs.output,
schema=infer_schema.outputs.output)
Configuration
Visualization
Component: Transform
transform = Transform(
input_data=example_gen.outputs.examples,
schema=infer_schema.outputs.output,
module_file=taxi_module_file)
Configuration
for key in _DENSE_FLOAT_FEATURE_KEYS:
outputs[_transformed_name(key)] = transform.scale_to_z_score(
_fill_in_missing(inputs[key]))
# ...
outputs[_transformed_name(_LABEL_KEY)] = tf.where(
tf.is_nan(taxi_fare),
tf.cast(tf.zeros_like(taxi_fare), tf.int64),
# Test if the tip was > 20% of the fare.
tf.cast(
tf.greater(tips, tf.multiply(taxi_fare, tf.constant(0.2))), tf.int64))
# ...
CodeTransform
Data Schema
Transform
Graph
Transformed
Data
ExampleGen SchemaGen
Trainer
Inputs and Outputs
Code
Using TensorFlow Transform for Feature Engineering
Using TensorFlow Transform for Feature Engineering
Training Serving
Component: Trainer
trainer = Trainer(
module_file=taxi_module_file,
transformed_examples=transform.outputs.transformed_examples,
schema=infer_schema.outputs.output,
transform_output=transform.outputs.transform_output,
train_steps=10000,
eval_steps=5000,
warm_starting=True)
Configuration
Code: Just TensorFlow :)
Trainer
Data Schema
Transform SchemaGen
Evaluator
Inputs and Outputs
Code
Transform
Graph
Model
Validator
Pusher
Model(s)
Component: Evaluator
Evaluator
Data Model
ExampleGen Trainer
Inputs and Outputs
Evaluation
Metrics
model_analyzer = Evaluator(
examples=examples_gen.outputs.output,
eval_spec=taxi_eval_spec,
model_exports=trainer.outputs.output)
Configuration
Visualization
Component: ModelValidator
Model
Validator
Data
ExampleGen Trainer
Inputs and Outputs
Validation
Outcome
Model (x2)
model_validator = ModelValidator(
examples=examples_gen.outputs.output,
model=trainer.outputs.output,
eval_spec=taxi_mv_spec)
Configuration
● Configuration options
○ Validate using current eval data
○ “Next-day eval”, validate using unseen data
Component: Pusher
Validation
Outcome
Pusher
Model
Validator
Inputs and Outputs
Pusher
Pusher
Deployment
Options
pusher = Pusher(
model_export=trainer.outputs.output,
model_blessing=model_validator.outputs.blessing,
serving_model_dir=serving_model_dir)
Configuration
● Block push on validation outcome
● Push destinations supported today
○ Filesystem (TensorFlow Lite, TensorFlow JS)
○ TensorFlow Serving
Apache Beam and Apache Flink
Apache Beam
Sum Per Key
⋮
input | Sum.PerKey()
Python
input.apply(
Sum.integersPerKey())
Java
stats.Sum(s, input)
Go
SELECT key, SUM(value)
FROM input GROUP BY key
SQL
Cloud Dataflow
Apache Spark
Apache Flink
Apache Apex
Gearpump
Apache Samza
Apache Nemo
(incubating)
IBM Streams
PTransforms
● More transforms available in Java
than Python
● Python can invoke Java
transforms (coming soon)
with self.create_pipeline() as p:
res = (
p | GenerateSequence(start=1, stop=10,
expansion_service=expansion_address))
GenerateSequence is written in Java
I/O
● More I/O available in Java than
Python
● Python can invoke Java I/O
(coming soon)
(Coming soon)
Language File-based Messaging Database
Java Beam Java supports Apache HDFS, Amazon S3, Google Cloud Storage, and local
filesystems.
FileIO (general-purpose reading, writing, and matching of files)
AvroIO
TextIO
TFRecordIO
XmlIO
TikaIO
ParquetIO
RabbitMqIO
SqsIO
Amazon Kinesis
AMQP
Apache Kafka
Google Cloud Pub/Sub
JMS
MQTT
Apache Cassandra
Apache Hadoop Input/Output Format
Apache HBase
Apache Hive (HCatalog)
Apache Kudu
Apache Solr
Elasticsearch (v2.x, v5.x, v6.x)
Google BigQuery
Google Cloud Bigtable
Google Cloud Datastore
Google Cloud Spanner
JDBC
MongoDB
Redis
Per element ParDo (Map, etc)
Every item
processed
independently
Stateless
implementation
Per key Combine (Reduce, etc)
65
Items grouped by some
key and combined
Stateful streaming
implementation
But your code doesn't
work with state, just
associative &
commutative function
66
Event Time Windowing
8:00
8:00
8:00
Classic parallel IO
67
"Embarrassingly parallel" (idealized)
Non-parallel execution
time
workersworkers
time
"Embarrassingly parallel" (actual, most systems)
workers
time
Beam's dynamic work rebalancing
Without dynamic work rebalancing
workers
time
With dynamic work rebalancing
workers
time
Beam's APIs make this the default approach
Beam's dynamic work rebalancing
69
A classic MapReduce job (read from Google Cloud Storage,
GroupByKey, write to Google Cloud Storage), 400 workers.
Dynamic Work Rebalancing disabled to demonstrate stragglers.
X axis: time (total ~20min.); Y axis: workers
Same job, Dynamic Work Rebalancing
enabled by Beam’s Splittable DoFn.
X axis: time (total ~15min.); Y axis: workers
Savings!
Dataflow’s Liquid Sharding
● Monitors worker progress and identify
stragglers
● Asks stragglers to give away part of their
unprocessed work (e.g., a sub-range of a file or
a key range)
● Schedule new work items onto idle workers
● Repeat for the next stragglers
The amount of work to give away is chosen so that
the worker is expected to complete soon enough and
stop being a straggler
Non-trivial to implement
How does Beam map to Flink?
Beam’s Flink Runner
Beam ParDo
Element-wise transformation parameterized by
a chunk of user code. Elements are processed
in bundles, with initialization and termination
hooks. Bundle size is chosen by the runner and
cannot be controlled by user code. ParDo
processes a main input PCollection one
element at a time, but provides side input
access to additional PCollections.
Flink Python Runner
Yes: fully supported
ParDo itself, as per-element transformation
with UDFs, is fully supported by Flink for both
batch and streaming.
Beam’s Flink Runner
Beam GroupByKey
Grouping of key-value pairs per key, window,
and pane.
Flink Python Runner
Yes: fully supported
Uses Flink's keyBy for key grouping. When
grouping by window in streaming (creating the
panes) the Flink runner uses the Beam code.
This guarantees support for all windowing and
triggering mechanisms.
Beam’s Flink Runner
Beam Stateful Processing
Allows fine-grained access to per-key,
per-window persistent state and timers. Timers
are integral to stateful processing. Necessary
for certain use cases (e.g. high-volume
windows which store large amounts of data, but
typically only access small portions of it;
complex state machines; etc.) that are not
easily or efficiently addressed via Combine or
GroupByKey+ParDo.
Flink Python Runner
Partially: non-merging windows
State is supported for non-merging windows.
MapState fully supported.
Beam’s Flink Runner
Beam Splittable DoFn (SDF)
Allows users to develop DoFn's that process a
single element in portions ("restrictions"),
executed in parallel or sequentially. This
supersedes the unbounded and bounded
`Source` APIs by supporting all of their features
on a per-element basis. See
http://s.apache.org/splittable-do-fn. Design is in
progress on achieving parity with Source API
regarding progress signals.
Flink Python Runner
Not supported
github.com/tensorflow/tfx
tensorflow.org/tfx

Flink Forward San Francisco 2019: TensorFlow Extended: An end-to-end machine learning platform for TensorFlow - Robert Crowe

  • 1.
  • 2.
  • 3.
    Configuration Data Collection Data Verification Feature Extraction ProcessManagement Tools Analysis Tools Machine Resource Management Serving Infrastructure Monitoring ML Code
  • 4.
    Data Ingestion Data Analysis + Validation Feature Engineering Trainer ModelEvaluation and Validation Serving Logging Shared Utilities for Garbage Collection, Data Access Controls Pipeline Storage Tuner Shared Configuration Framework and Job Orchestration Integrated Frontend for Job Management, Monitoring, Debugging, Data/Model/Evaluation Visualization An ML pipeline is part of the solution to this problem
  • 5.
    Data Ingestion TensorFlow Data Validation TensorFlow Transform Estimator or Keras Model TensorFlow ModelAnalysis TensorFlow Serving Logging Shared Utilities for Garbage Collection, Data Access Controls Pipeline Storage Tuner Shared Configuration Framework and Job Orchestration Integrated Frontend for Job Management, Monitoring, Debugging, Data/Model/Evaluation Visualization TensorFlow Extended (TFX) is an end-to-end ML pipeline for TensorFlow
  • 6.
  • 8.
    ● A unifiedbatch and stream distributed processing API ● A set of SDK frontends: Java, Python, Go, Scala, SQL, … ● A set of runners which can execute Beam jobs into various backends: Local, Apache Flink, Apache Spark, Apache Gearpump, Apache Samza, Apache Hadoop, Google Cloud Dataflow, …
  • 9.
    Provides a comprehensiveportability framework for data processing pipelines, which allows you to write your pipeline once in your language of choice and run it with minimal effort on your execution engine of choice.
  • 10.
    Data Ingestion TensorFlow Transform Estimator Model TensorFlow ModelAnalysis Honoring Validation Outcomes TensorFlow Data Validation TensorFlow Serving ExampleGen StatisticsGen SchemaGen Example Validator Transform Trainer Evaluator Model Validator Pusher Model Server Powered by Beam Powered by Beam
  • 11.
  • 12.
  • 13.
  • 14.
    Metadata Store Config Last Validated Model New(Candidate) Model Model Validator Validation Outcome Context
  • 15.
    Metadata Store Trainer Config Last Validated Model New(Candidate) Model New Model Model Validator Validation Outcome Pusher New (Candidate) Model Validation Outcome Deployment targets: TensorFlow Serving TensorFlow Lite TensorFlow JS TensorFlow Hub
  • 17.
  • 18.
    Trainer Task-Aware Pipelines Input Data Transformed Data Trained Models Deployment Task-and Data-Aware Pipelines Pipeline + Metadata Storage Training Data Transform TrainerTransform
  • 19.
    Trained Models Type definitions ofArtifacts and their Properties E.g., Models, Data, Evaluation Metrics
  • 20.
    Trained Models Type definitions ofArtifacts and their Properties E.g., Models, Data, Evaluation Metrics Trainer Execution Records (Runs) of Components E.g., Runtime Configuration, Inputs + Outputs
  • 21.
    Trained Models Type definitions ofArtifacts and their Properties E.g., Models, Data, Evaluation Metrics Trainer Execution Records (Runs) of Components E.g., Runtime Configuration, Inputs + Outputs Lineage Tracking Across All Executions E.g., to recurse back to all inputs of a specific artifact
  • 23.
  • 29.
    Use-cases enabled bylineage tracking
  • 30.
    Use-cases enabled bylineage tracking Compare previous model runs
  • 31.
    Use-cases enabled bylineage tracking Compare previous model runs Carry-over state from previous models
  • 32.
    Use-cases enabled bylineage tracking Compare previous model runs Carry-over state from previous models Re-use previously computed outputs
  • 33.
  • 34.
  • 35.
    Component Executor Legend Driver and Publisher Driver MetadataStore Publisher Driver Publisher Driver Publisher Executor Executor Executor
  • 36.
    TFX Config Component Executor Legend Driver andPublisher Driver Metadata Store Publisher Driver Publisher Driver Publisher Executor Executor Executor
  • 37.
  • 38.
  • 39.
  • 40.
    def create_pipeline(): """Implements thechicago taxi pipeline with TFX.""" examples = csv_input(os.path.join(data_root, 'simple')) example_gen = CsvExampleGen(input_base=examples) statistics_gen = StatisticsGen(input_data=...) infer_schema = SchemaGen(stats=...) validate_stats = ExampleValidator(stats=..., schema=...) # Performs transformations and feature engineering in training and serving transform = Transform( input_data=example_gen.outputs.examples, schema=infer_schema.outputs.output, module_file=_taxi_module_file) trainer = Trainer(...) model_analyzer = Evaluator(examples=..., model_exports=...) model_validator = ModelValidator(examples=..., model=...) pusher = Pusher(model_export=..., model_blessing=..., serving_model_dir=...) return [example_gen, statistics_gen, infer_schema, validate_stats, transform, trainer, model_analyzer, model_validator, pusher] pipeline = AirflowDAGRunner(_airflow_config).run(_create_pipeline())
  • 41.
    class Executor(base_executor.BaseExecutor): """Generic TFXstatsgen executor.""" ... def Do(...) -> None: """Computes stats for each split of input using tensorflow_data_validation. ... with beam.Pipeline(argv=self._get_beam_pipeline_args()) as p: for split, instance in split_to_instance.items(): ... output_path = os.path.join(output_uri, _DEFAULT_FILE_NAME) _ = ( p | 'ReadData.' + split >> beam.io.ReadFromTFRecord(file_pattern=input_uri) | 'DecodeData.' + split >> tf_example_decoder.DecodeTFExample() | 'GenerateStatistics.' + split >> stats_api.GenerateStatistics(stats_options) | 'WriteStatsOutput.' + split >> beam.io.WriteToTFRecord( output_path,shard_name_template='', coder=beam.coders.ProtoCoder( statistics_pb2.DatasetFeatureStatisticsList))) tf.logging.info('Statistics written to {}.'.format(output_uri))
  • 42.
    def preprocessing_fn(inputs): with beam.Pipeline()as pipeline: ... raw_data = ( pipeline | 'ReadTrainData' >> beam.io.ReadFromText(train_data_file) | 'FixCommasTrainData' >> beam.Map( lambda line: line.replace(', ', ',')) | 'DecodeTrainData' >> MapAndFilterErrors(converter.decode)) transformed_dataset, transform_fn = ( raw_dataset | tft_beam.AnalyzeAndTransformDataset(preprocessing_fn)) ... return outputs
  • 43.
  • 44.
    Your own runtime...Kubeflow Runtime Metadata Store Component Driver and Publisher Executor Legend Airflow Runtime TFX Config
  • 45.
  • 46.
    Kubeflow Runtime ExampleGen StatisticsGen SchemaGen Example Validator TransformTrainer Evaluator Model Validator Pusher TFX Config Metadata Store Training + Eval Data TensorFlow Serving TensorFlow Hub TensorFlow Lite TensorFlow JS TFX: Putting it all together. Airflow Runtime
  • 47.
    Component: ExampleGen examples =csv_input(os.path.join(data_root, 'simple')) example_gen = CsvExampleGen(input_base=examples) Configuration Example Gen Raw Data Inputs and Outputs CSV TF Record Split TF Record Data Training Eval
  • 48.
  • 49.
    Analyzing Data withTensorFlow Data Validation
  • 50.
    Component: SchemaGen SchemaGen Statistics StatisticsGen Inputs andOutputs Schema infer_schema = SchemaGen(stats=statistics_gen.outputs.output) Configuration Visualization
  • 51.
    Component: ExampleValidator Example Validator Statistics Schema StatisticsGenSchemaGen Inputs and Outputs Anomalies Report validate_stats = ExampleValidator( stats=statistics_gen.outputs.output, schema=infer_schema.outputs.output) Configuration Visualization
  • 52.
    Component: Transform transform =Transform( input_data=example_gen.outputs.examples, schema=infer_schema.outputs.output, module_file=taxi_module_file) Configuration for key in _DENSE_FLOAT_FEATURE_KEYS: outputs[_transformed_name(key)] = transform.scale_to_z_score( _fill_in_missing(inputs[key])) # ... outputs[_transformed_name(_LABEL_KEY)] = tf.where( tf.is_nan(taxi_fare), tf.cast(tf.zeros_like(taxi_fare), tf.int64), # Test if the tip was > 20% of the fare. tf.cast( tf.greater(tips, tf.multiply(taxi_fare, tf.constant(0.2))), tf.int64)) # ... CodeTransform Data Schema Transform Graph Transformed Data ExampleGen SchemaGen Trainer Inputs and Outputs Code
  • 53.
    Using TensorFlow Transformfor Feature Engineering
  • 54.
    Using TensorFlow Transformfor Feature Engineering Training Serving
  • 55.
    Component: Trainer trainer =Trainer( module_file=taxi_module_file, transformed_examples=transform.outputs.transformed_examples, schema=infer_schema.outputs.output, transform_output=transform.outputs.transform_output, train_steps=10000, eval_steps=5000, warm_starting=True) Configuration Code: Just TensorFlow :) Trainer Data Schema Transform SchemaGen Evaluator Inputs and Outputs Code Transform Graph Model Validator Pusher Model(s)
  • 56.
    Component: Evaluator Evaluator Data Model ExampleGenTrainer Inputs and Outputs Evaluation Metrics model_analyzer = Evaluator( examples=examples_gen.outputs.output, eval_spec=taxi_eval_spec, model_exports=trainer.outputs.output) Configuration Visualization
  • 57.
    Component: ModelValidator Model Validator Data ExampleGen Trainer Inputsand Outputs Validation Outcome Model (x2) model_validator = ModelValidator( examples=examples_gen.outputs.output, model=trainer.outputs.output, eval_spec=taxi_mv_spec) Configuration ● Configuration options ○ Validate using current eval data ○ “Next-day eval”, validate using unseen data
  • 58.
    Component: Pusher Validation Outcome Pusher Model Validator Inputs andOutputs Pusher Pusher Deployment Options pusher = Pusher( model_export=trainer.outputs.output, model_blessing=model_validator.outputs.blessing, serving_model_dir=serving_model_dir) Configuration ● Block push on validation outcome ● Push destinations supported today ○ Filesystem (TensorFlow Lite, TensorFlow JS) ○ TensorFlow Serving
  • 59.
    Apache Beam andApache Flink
  • 60.
    Apache Beam Sum PerKey ⋮ input | Sum.PerKey() Python input.apply( Sum.integersPerKey()) Java stats.Sum(s, input) Go SELECT key, SUM(value) FROM input GROUP BY key SQL Cloud Dataflow Apache Spark Apache Flink Apache Apex Gearpump Apache Samza Apache Nemo (incubating) IBM Streams
  • 61.
    PTransforms ● More transformsavailable in Java than Python ● Python can invoke Java transforms (coming soon) with self.create_pipeline() as p: res = ( p | GenerateSequence(start=1, stop=10, expansion_service=expansion_address)) GenerateSequence is written in Java
  • 62.
    I/O ● More I/Oavailable in Java than Python ● Python can invoke Java I/O (coming soon) (Coming soon)
  • 63.
    Language File-based MessagingDatabase Java Beam Java supports Apache HDFS, Amazon S3, Google Cloud Storage, and local filesystems. FileIO (general-purpose reading, writing, and matching of files) AvroIO TextIO TFRecordIO XmlIO TikaIO ParquetIO RabbitMqIO SqsIO Amazon Kinesis AMQP Apache Kafka Google Cloud Pub/Sub JMS MQTT Apache Cassandra Apache Hadoop Input/Output Format Apache HBase Apache Hive (HCatalog) Apache Kudu Apache Solr Elasticsearch (v2.x, v5.x, v6.x) Google BigQuery Google Cloud Bigtable Google Cloud Datastore Google Cloud Spanner JDBC MongoDB Redis
  • 64.
    Per element ParDo(Map, etc) Every item processed independently Stateless implementation
  • 65.
    Per key Combine(Reduce, etc) 65 Items grouped by some key and combined Stateful streaming implementation But your code doesn't work with state, just associative & commutative function
  • 66.
  • 67.
    Classic parallel IO 67 "Embarrassinglyparallel" (idealized) Non-parallel execution time workersworkers time "Embarrassingly parallel" (actual, most systems) workers time
  • 68.
    Beam's dynamic workrebalancing Without dynamic work rebalancing workers time With dynamic work rebalancing workers time Beam's APIs make this the default approach
  • 69.
    Beam's dynamic workrebalancing 69 A classic MapReduce job (read from Google Cloud Storage, GroupByKey, write to Google Cloud Storage), 400 workers. Dynamic Work Rebalancing disabled to demonstrate stragglers. X axis: time (total ~20min.); Y axis: workers Same job, Dynamic Work Rebalancing enabled by Beam’s Splittable DoFn. X axis: time (total ~15min.); Y axis: workers Savings!
  • 70.
    Dataflow’s Liquid Sharding ●Monitors worker progress and identify stragglers ● Asks stragglers to give away part of their unprocessed work (e.g., a sub-range of a file or a key range) ● Schedule new work items onto idle workers ● Repeat for the next stragglers The amount of work to give away is chosen so that the worker is expected to complete soon enough and stop being a straggler Non-trivial to implement
  • 71.
    How does Beammap to Flink?
  • 72.
    Beam’s Flink Runner BeamParDo Element-wise transformation parameterized by a chunk of user code. Elements are processed in bundles, with initialization and termination hooks. Bundle size is chosen by the runner and cannot be controlled by user code. ParDo processes a main input PCollection one element at a time, but provides side input access to additional PCollections. Flink Python Runner Yes: fully supported ParDo itself, as per-element transformation with UDFs, is fully supported by Flink for both batch and streaming.
  • 73.
    Beam’s Flink Runner BeamGroupByKey Grouping of key-value pairs per key, window, and pane. Flink Python Runner Yes: fully supported Uses Flink's keyBy for key grouping. When grouping by window in streaming (creating the panes) the Flink runner uses the Beam code. This guarantees support for all windowing and triggering mechanisms.
  • 74.
    Beam’s Flink Runner BeamStateful Processing Allows fine-grained access to per-key, per-window persistent state and timers. Timers are integral to stateful processing. Necessary for certain use cases (e.g. high-volume windows which store large amounts of data, but typically only access small portions of it; complex state machines; etc.) that are not easily or efficiently addressed via Combine or GroupByKey+ParDo. Flink Python Runner Partially: non-merging windows State is supported for non-merging windows. MapState fully supported.
  • 75.
    Beam’s Flink Runner BeamSplittable DoFn (SDF) Allows users to develop DoFn's that process a single element in portions ("restrictions"), executed in parallel or sequentially. This supersedes the unbounded and bounded `Source` APIs by supporting all of their features on a per-element basis. See http://s.apache.org/splittable-do-fn. Design is in progress on achieving parity with Source API regarding progress signals. Flink Python Runner Not supported
  • 76.