Scaling AI in production using PyTorch

1 7 J U N E 2 0 2 1
S C A L I N G A I I N P R O D U C T I O N
U S I N G P Y T O R C H
G E E T A C H A U H A N
PyTorch Partner Engineering, Facebook AI
@ C H A U H A N G

MLOPS World 2021
A G E N D A 0 1

C H A L L E N G E S W I T H M L I N
P R O D U C T I O N

0 2

T O R C H S E R V E O V E R V I E W

0 3

B E S T P R A C T I C E S F O R P R O D U C T I O N
D E P L O Y M E N T

MLOps World 2021
P Y T O R C H C O M M U N I T Y G R O W T H
Source: https://paperswithcode.com/trends

MLOps World 2021
●
●
●
Cloud / On-Prem
Preprocessing
Application
Application logic
Application logic
Postprocessing
. . .
. . .
. . .
Performance Ease of use
Cost efficiency Deployment at scale
C H A L L E N G E S W I T H M L I N D E P L O Y M E N T

MLOps World 2021
INFERENCE AT SCALE
Deploying and managing models in production is
di
ffi
cult.

Some of the pain points include:
Loading and managing multiple models, on multiple
servers or end devices

Running pre-processing and post-processing code on
prediction requests.

How to log, monitor and secure predictions

What happens when you hit scale?

MLOps World 2021
TORCHSERVE
Easily deploy PyTorch models in production at scale

D E F A U LT H A N D L E R S
F O R C O M M O N T A S K S
L O W L AT E N C Y M O D E L
S E R V I N G
W O R K S W I T H A N Y M L
E N V I R O N M E N T

MLOps World 2021
• Default handlers for common use
cases (e.g., image segmentation,
text classification) along with
custom handlers support for other
use cases and a Model Zoo

• Multi-model serving, Model
versioning and ability to roll back
to an earlier version

• Automatic batching of individual
inferences across HTTP requests
• Logging including common
metrics, and the ability to
incorporate custom metrics

• Robust HTTP APIS -
Management and Inference
model1.pth
model1.pth
model1.pth
torch-model-archiver
HTTP
HTTP
http://localhost:8080/ …

http://localhost:8081/ …
Logging Metrics
model1.mar model2.mar model3.mar
model4.mar model5.mar
<path>/model_store
Inference API
Management API
TorchServe
Metrics API
Inference
API
Serving Model 3
Serving Model 2
Serving Model 1
torchserve --start
TORCHSERVE

T O R C H S E R V E D E T A I L :

M O D E L H A N D L E R S
TorchServe has default model handlers that
perform boilerplate data transforms for
common cases:

• Image Classification

• Image Segmentation

• Object Detection

• Text Classification

You can also create custom model handlers
for any model and inference task.
import torch

class MyModelHandler(object):

    def initialize(self, context):

# get GPU status & device handle

# load model & supporting files (vocabularies etc.)

    def preprocess(self, data):

# put incoming data into tensor

# transform as needed for your model

    def inference(self, context):

# do predictions

    def postprocess(self, output):

# process inference output, e.g. extracting top K

# package output for web delivery

    def handle(self, context):

if not _service.initialized:

_service.initialize(context)

if data is None:

return None

data = _service.preprocess(data)

data = _service.inference(data)

data = _service.postprocess(data)

return data

M O D E L A R C H I V E
torch-model-archiver cli tool for packaging all
model artifacts into a single deployment unit

• model checkpoints or model definition file
with state_dict

• torchscript and eager mode support

• Extra files like vocab, config, index_to_name
mapping

torch-model-archiver
 
—model-name BERTSeqClassification_Torchscript
 
--version 1.0
 
--serialized-file Transformer_model/traced_model.pt
 
--handler ./Transformer_handler_generalized.py
 
--extra-files "./setup_config.json,./
Seq_classification_artifacts/index_to_name.json"
 
  
setup.config
 
{

“model_name": "bert-base-uncased",

“mode": "sequence_classification",

“do_lower_case": "True",

“num_labels": "2",

“save_mode": "torchscript",

“max_length": "150"

}
 
 
torchserve --start
 
--model-store model_store
 
—-models <path-to model-file/s3-url/azure-blob-url>
https://github.com/pytorch/serve/tree/master/model-archiver#creating-a-model-archive

D Y N A M I C B A T C H I N G
Via Custom Handlers

• Model Configuration based

• batch_size Max batch size

• max_batch_delay The max batch delay time
TorchServe waits to
receive batch_size number of requests
 
• (Coming soon) Batching support in default
handlers

curl localhost:8081/models/resnet-152

{

"modelName": "resnet-152",

"modelUrl": "https://s3.amazonaws.com/model-server/
model_archive_1.0/examples/resnet-152-batching/resnet-152.ma

"runtime": "python",

"minWorkers": 1,

"maxWorkers": 1,

"batchSize": 8,

"maxBatchDelay": 10,

"workers": [

{

"id": "9008",

"startTime": "2019-02-19T23:56:33.907Z",

"status": "READY",

"gpu": false,

"memoryUsage": 607715328

}

]

}

https://github.com/pytorch/serve/blob/master/docs/batch_inference_with_ts.md

M E T R I C S
Out of box metrics with ability to extend

• CPU, Disk, Memory utilization

• Requests type count

• ts.metrics class for extension

• Types supported - Size, percentage, counter,
general metric

• Prometheus metrics support available

# Access context metrics as follows

metrics = context.metrics

# Create Dimension Object

from ts.metrics.dimension import Dimension

# Dimensions are name value pairs

dim1 = Dimension(name, value)

.

dimN= Dimension(name_n, value_n)

# Add Distance as a metric

# dimensions = [dim1, dim2, dim3, ..., dimN]

metrics.add_metric('DistanceInKM', distance, 'km',
dimensions=dimensions)

# Add Image size as a size metric

metrics.add_size('SizeOfImage', img_size, None, 'MB', dimensions)

# Add MemoryUtilization as a percentage metric

metrics.add_percent('MemoryUtilization', utilization_percent, None,
dimensions)

# Create a counter with name 'LoopCount' and dimensions

metrics.add_counter('LoopCount', 1, None, dimensions)

# Log custom metrics

for metric in metrics.store:

logger.info("[METRICS]%s", str(metric))

https://github.com/pytorch/serve/blob/master/docs/metrics.md

MLOps World 2021
RECENT FEATURES
+ Ensemble Model support, Captum Model Interpretability

+ Kubeflow Pipelines /KFServing Integration with Auto-scaling and Canary rollout on any cloud/on-prem
 
+ GCP Vertex AI Serverless pipelines

+ MLflow Integration

+ Prometheus Integration with Grafana

+ Multiple nodes on EC2, Autoscaling on SageMaker/EKS, AWS Inferentia support

+ MMF, NMT, DeepLapV3 new examples

Deployment
models
Optimizations Resilience Measurement
Responsible AI
Standalon
e

Primary backu
p

Orchestratio
n

Cloud vs.  
on-premises
Performance vs.
latency

TorchScript profilin
g

Offline vs. real-tim
e

Cost
Robust endpoin
t

Auto-scalin
g

Canary
deployment
s

A / B testing
Metric
s

Model
performanc
e

Interpretabilit
y

Feedback loop
Fairnes
s

Human-centered
design
B E S T P R A C T I C E S F O R P R O D U C T I O N D E P L O Y M E N T S

MLOps World 2021
Fairness by design

• Measure skewness of data, model bias, data bias; identify relevant metrics

• Transparency, Explainable AI, inclusive design

Human-centered design

• Consider AI-driven decisions and their impact on people at the time of model design

• Provide ability to have human recourse vs. full automation – for example, need to avoid a mortgage
applications AI rejecting people of certain category or race

• Computer vision models measure results based on demographics; for example, include support for different
skin tones, age groups
R E S P O N S I B L E A I

MLOps World 2021
• Build with performance vs. latency goals in mind

• Reduce size of the model: Quantization, pruning, mixed precision training

• Reduce latency: TorchScript model; use SnakeViz profiler

• Evaluate GPU vs. CPU for low latency

• Evaluate REST vs. gRPC for your prediction service
O P T I M I Z A T I O N S

MLOps World 2021
fp32 accuracy int8 accuracy change Technique CPU inference speed up
ResNet50 76.1
 
Top-1, Imagenet
-0.2
 
75.9
Post Training
2x
 
214ms ➙102ms,
 
Intel Skylake-DE
MobileNetV2 71.9

Top-1, Imagenet
-0.3

71.6
Quantization-Aware
Training
4x
 
75ms ➙18ms
 
OnePlus 5, Snapdragon 835
Translate / FairSeq 32.78
 
BLEU, IWSLT 2014 de-en
0.0
 
32.78
Dynamic
 
(weights only)
4x
 
for encoder
 
Intel Skylake-SE
These models and more available on TorchHub - https://pytorch.org/hub/
QUANTIZATION

MLOps World 2021
B E R T

M O D E L

P R O F I L I N G

Eager Mode

MLOps World 2021
B E R T

M O D E L

P R O F I L I N G

Torchscript Mode

4x speedup

MLOps World 2021
Offline vs. real-time predictions

• Offline: Dynamic batching

• Online: Async processing – push/poll

• Pre-computed predictions for certain elements

Cost optimizations

• Spot Instances for offline

• Autoscaling based on metrics, on-demand cluster

• Evaluate AI Accelerators supported like AWS Inferentia for lower cost point

O P T I M I Z A T I O N S ( C O N T D . )

MLOps World 2021
Develop
,

Test
Production
Staging
,

Experiments
Hybrid Cloud
On-prem Cloud Managed
Install from Source
Standalone
Docker
Large Scale 
Production
MLflow, Kubeflow
Kubernetes, Kubeflow/KFserving
Primary/Backup, ML Microservices
Autoscaling, Canary Rollouts
Minikub
e

Self managed Docker AWS CloudFormation
CLOUD VMs/ Containers
Microservices behind

API Gateway
CLOUD VMs/ Containers
AWS SageMaker
Endpoints, BYOC
AWS SageMaker
EKS/AKS/GKE
AWS SageMaker/ GCP
AI Platform
Serverless Functions
GCP Vertex AI,

AWS SageMaker

Canary Rollouts
Databricks
Managed MLflow
D E P L O Y I N G M O D E L S I N P R O D U C T I O N

MLOps World 2021
Create robust endpoint for serving, for example, SageMaker endpoint

Auto-scaling with orchestration deployments, multi-node for EC2, and other scenarios

Canary deployments, test new version of a model on small subset before making
default

Shadow inference, deploy new version of model in parallel

A / B testing of different versions of model
R E S I L L I E N C E

MLOps World 2021
Define model performance metrics, such as accuracy, while designing the AI service;
use-case specific

Add custom metrics as appropriate

Use CloudWatch or Prometheus dashboards for monitoring model performance

Model interpretability analysis via Captum

Deploy with a feedback loop, if model accuracy drops over time or new version,
analyze issues like concept drift, stale data, etc.
M E A S U R E M E N T

MLOps World 2021
Understand
Align
Mitigate
Monitor
Measure
Stakeholder conversations to find
 
consensus and outline measurement and
mitigation plans

Analyze model performance,
 
label bias, outcomes, and other
relevant signals
Address observed
 
issues in dataset,
 
models, policies, etc
How might the product’s goals, its policy,
and its implementation affect users from
different subgroups? Identify contextual
definitions of fairness

Monitor effect of mitigations on
 
subgroups, and ensure fairness
analysis holds as product adapts

FAIRNESS BY DESIGN

CAPTUM
Text Contributions: 7.54

Image Contributions: 11.19

Total Contributions: 18.73
0 200 400 600 800
400
300
200
100
0
S U P P O R T F O R AT T R I B U T I O N A LG O R I T H M S
 
T O I N T E R P R E T:

• Output predictions with respect to inputs

• Output predictions with respect to layers

• Neurons with respect to inputs

• Currently provides gradient & perturbation based
approaches (e.g. Integrated Gradients)
Model interpretability library for PyTorch
https://captum.ai/

MLOps World 2021
DYNABOARD & FLORES 101 WMT COMPETITION
http://www.statmt.org/wmt21/large-scale-multilingual-translation-task.html
https://github.com/facebookresearch/dynalab
https://dynabench.org/tasks/3#overall

MLOps World 2021
COMMUNIT Y PROJECTS https://github.com/cceyda/torchserve-dashboard
https://github.com/Unity-Technologies/SynthDet
https://medium.com/pytorch/how-wadhwani-ai-uses-pytorch-
to-empower-cotton-farmers-14397f4c9f2b

MLOps World 2021
FUTURE RELEASES
+ Improved memory and resource usage for better scalability

+ C++ Backend for lower latency

+ Enhanced profiling tools

• TorchServe: https://github.com/pytorch/serve

• Management API: https://github.com/pytorch/serve/blob/master/docs/management_api.md

• Inference API: https://github.com/pytorch/serve/blob/master/docs/inference_api.md

• Language Translation Ensemble example: https://github.com/pytorch/serve/tree/master/examples/Work
fl
ows/nmt_tranformers_pipeline

• BERT Model example: https://github.com/pytorch/serve/tree/master/examples/Huggingface_Transformers

• Model Zoo: https://github.com/pytorch/serve/blob/master/docs/model_zoo.md

• SnakeViz visualizations: https://github.com/pytorch/serve/tree/master/benchmarks#visualize-snakeviz-results

• Logging: https://github.com/pytorch/serve/blob/master/docs/logging.md

• Metrics: https://github.com/pytorch/serve/blob/master/docs/metrics.md

• Prometheus Metrics: https://gith ub.com/pytorch/serve/blob/master/docs/metrics_api.md

• Batch Inference: https://github.com/pytorch/serve/blob/master/docs/batch_inference_with_ts.md

• Kube
fl
ow Pipelines: https://github.com/kube
fl
ow/pipelines/tree/master/components/PyTorch/pytorch-kfp-components

• Kubernetes support: https://github.com/pytorch/serve/blob/master/kubernetes/README.md

• TorchServe Dashboard (Community): https://cceyda.github.io/blog/torchserve/streamlit/dashboard/2020/10/15/torchserve.html

• Custom Handler community blog: https://towardsdatascience.com/deploy-models-and-create-custom-handlers-in-torchserve-
fc2d048fbe91

• Captum Interpretability for BERT models: https://github.com/pytorch/serve/blob/master/captum/Captum_visualization_for_bert.ipynb

• Operationalize, Scale and Infuse Trust in AI using KFServing: https://blog.kube
fl
ow.org/release/o
ffi
cial/2021/03/08/kfserving-0.5.html

REFERENCES

QUESTIONS?

Contact:

Email: gchauhan@fb.com

Linkedin: https://www.linkedin.com/in/geetachauhan/

Scaling AI in production using PyTorch

More Related Content

What's hot

Similar to Scaling AI in production using PyTorch

More from geetachauhan

Recently uploaded

In this document

Scaling AI in production using PyTorch