The document outlines lessons learned from managing data science projects, emphasizing the importance of understanding business problems, being data-driven, and establishing clear performance metrics. It advocates for an end-to-end solution, the involvement of human expertise, and the necessity of a standardized process for reproducibility in data science. Key tips include embedding data science teams within the business, promoting continuous learning, and using cloud resources effectively.
Our strategy isto build best-in-class
platforms and productivity services for
an intelligent cloud and an
intelligent edge infused with artificial
intelligence (“AI”).
Microsoft Form 10-K 2016
What is thebusiness problem that
needs to be solved, independent of
the technology solution?
What is the decision or action has to
be taken that can be informed by
data.
Understanding the DecisionProcess
Key Decision
Should I service
this piece of
equipment?
Data Science Question
What is the probability
this equipment will fail
within the next X days?
Business Scenario KeyDecision Data Science Question
Energy Forecasting Should I buy or sell energy
contracts?
What will be the long/short term demand
for energy in a region?
Customer Churn Which customers should I
prioritize to reduce churn?
What is probability of churn within X days
for each customer?
Personalized Marketing What product should I offer
first?
What is the probability that customer will
purchase each product?
Product Feedback Which service/product needs
attention?
What is social media sentiment for each
service/product?
Framing Data Science Question based on the Scenario
Establish a
Qualitative
Objective
Translate into
Quantifiable
Metric
Quantifythe
metric value
improvement
useful (e.g., 10%
fewer failures
savings of
$1MM/year)
Establish a
baseline
(e.g., current
failure rate =
10% per year)
Establish how to
measure the
improvement in
the metric with
the data science
solution (e.g.
80% of the
equipment
maintained
based on
predictive
model)
Using Performance Metrics
Tips:
1. Data scienceteam embedded within
the business
2. Allow exploring multiple problem
formulations to get to end metric goal
3. Past goal, go within set time period
4. Ensure reproducibility
1. Set upthe end to end solution and
the metrics
2. Launch with a baseline/simple
model
3. Act on the recommendations of
the solution
4. Measure and iterate
• Empower ALLto perform like the BEST
• Automate repetitive human tasks
• Embed expert knowledge into the solution
33.
• How tointerpret the model?
• Importance of Features
• Bias in the model
• Interpreting predictions per instance
• What-if analysis
Users don’t trust black-box models
1. Learn fromexperiments
• Why?
• Both Successes or Failures
2. Share the learnings
3. Promote successful experiments to production
4. Move on to the next hypothesis to experiment
38.
• Failure isa valid outcome of an
experiment
• Learn and refine the next experiment
A process specifiesa detailed sequence of activities
necessary to perform specific business tasks.
It is used to standardize procedures and
establish best practices.
41.
Microsoft’s Team DataScience Process
https://aka.ms/tdsp
Standard Project Lifecycle
Standardized Document
Templates, Project Structure
Shared, Distributed
Resources
Productivity Tools, Shared
Utilities
• Data sciencevirtual
machines (DSVMs) as the
fundamental development
platform on cloud
• Use Visual Studio Team
Services (VSTS)
• Work item tracking and scrum planning
• Git repositories
• Shared data science utilities
in Git repository
• Use cloud-based Azure
resources as needed
47.
• Terminology:
• Feature:a project
• Story: a stage in the E2E
process of a DS project
• Tasks: specific
coding/documentation/othe
r activities that are needed
to complete a story
• Iteration: usually a 2-week
sprint
50.
App Developer SourceControl
Cloud Services
CI/CD Pipelines
IDE
Data Scientist
Training Environment
[ { "cat": 0.99218,
"feline": 0.81242,
"puma": 0.45456: } ]
IDE
App code
Apps
Edge Devices
Model Storage
PUBLISHCODE CONSUME
Lifecycle Management
Processes. Templates. Permissions
Embed model
CNTK/TF/SCIKIT
KERAS/ …
Train&
testmodel
Data Lake
App telemetry
A/B
Testing
BUILD & TEST
Training+
testcode
Continuous retraining
Testmodel
+app
51.
Model Source Control
•Processes and procedures to make models
reproducible (from source control to data
retention policies)
• Make it easy to work on multiple models
(consistent process)
52.
Model Validation
• Unittesting, functional testing and
performance testing
• Validation needs to be performed both
isolation and when embedded in an
application
53.
Model Versioning &Storage
• Provide a consistent way to store & share
models, plus a way to track where models are
embedded / running
• Provide a consistent model format
• Provide traceability on where a model came
from (which data, which experiment, where’s
the code / notebook)
• Provide a way to track where model is running
• Control who has access to what models
54.
Model Deployment
• Providean efficient process to get a model build into an
application or service and leveraged to light up an end-user
scenario.
• Simplify the process to interact with the model (through code-
generation, API specifications / interfaces or other methods)
• Support a variety of inferencing targets (cloud / app / edge)
(including FPGAs or dedicated frameworks like CoreML & WinML)
• Provide secrets / service endpoint management to remove
friction from configuring the release process.
• Data Exploration
•RFM – User Behavior Modeling
• Hyper parameter tuning
• Auto Featurization
Note: Domain expertise is still
helpful
Building an Org’s Toolbox