Managing Data Science Projects

Executive Briefing
Lessons learned managing data
science projects: Adopting a team
data science process

Our strategy is to build best-in-class
platforms and productivity services for
an intelligent cloud and an
intelligent edge infused with artificial
intelligence (“AI”).
Microsoft Form 10-K 2016

Toolbox of a Data Scientist
8
8

Understand the Decision Process
Tip #1

What is the business problem that
needs to be solved, independent of
the technology solution?
What is the decision or action has to
be taken that can be informed by
data.

Understanding the Decision Process
Key Decision
Should I service
this piece of
equipment?
Data Science Question
What is the probability
this equipment will fail
within the next X days?

Business Scenario Key Decision Data Science Question
Energy Forecasting Should I buy or sell energy
contracts?
What will be the long/short term demand
for energy in a region?
Customer Churn Which customers should I
prioritize to reduce churn?
What is probability of churn within X days
for each customer?
Personalized Marketing What product should I offer
first?
What is the probability that customer will
purchase each product?
Product Feedback Which service/product needs
attention?
What is social media sentiment for each
service/product?
Framing Data Science Question based on the Scenario

Being Obsessed with Data
Can only complete the process with the right data!

Bring in the people that know the data

Establish Performance Metrics
Tip #3

What is considered a
success for the
business?

Establish a
Qualitative
Objective
Translate into
Quantifiable
Metric
Quantify the
metric value
improvement
useful (e.g., 10%
fewer failures 
savings of
$1MM/year)
Establish a
baseline
(e.g., current
failure rate =
10% per year)
Establish how to
measure the
improvement in
the metric with
the data science
solution (e.g.
80% of the
equipment
maintained
based on
predictive
model)
Using Performance Metrics

Document
Success Metrics
using a template

Tips:
1. Data science team embedded within
the business
2. Allow exploring multiple problem
formulations to get to end metric goal
3. Past goal, go within set time period
4. Ensure reproducibility

Establish the E2E solution
Tip #4

1. Set up the end to end solution and
the metrics
2. Launch with a baseline/simple
model
3. Act on the recommendations of
the solution
4. Measure and iterate

Establishing a E2E solution helps with
buy-in from the business

Keep a Human in the Loop
Tip #5

• Empower ALL to perform like the BEST
• Automate repetitive human tasks
• Embed expert knowledge into the solution

• How to interpret the model?
• Importance of Features
• Bias in the model
• Interpreting predictions per instance
• What-if analysis
Users don’t trust black-box models

1. Learn from experiments
• Why?
• Both Successes or Failures
2. Share the learnings
3. Promote successful experiments to production
4. Move on to the next hypothesis to experiment

• Failure is a valid outcome of an
experiment
• Learn and refine the next experiment

A process specifies a detailed sequence of activities
necessary to perform specific business tasks.
It is used to standardize procedures and
establish best practices.

Microsoft’s Team Data Science Process
https://aka.ms/tdsp
Standard Project Lifecycle
Standardized Document
Templates, Project Structure
Shared, Distributed
Resources
Productivity Tools, Shared
Utilities

Cross-Industry Standard Process for Data Mining
(CRISP-DM)
Knowledge Discovery in Databases
(KDD)

• Data science virtual
machines (DSVMs) as the
fundamental development
platform on cloud
• Use Visual Studio Team
Services (VSTS)
• Work item tracking and scrum planning
• Git repositories
• Shared data science utilities
in Git repository
• Use cloud-based Azure
resources as needed

• Terminology:
• Feature: a project
• Story: a stage in the E2E
process of a DS project
• Tasks: specific
coding/documentation/othe
r activities that are needed
to complete a story
• Iteration: usually a 2-week
sprint

App Developer Source Control
Cloud Services
CI/CD Pipelines
IDE
Data Scientist
Training Environment
[ { "cat": 0.99218,
"feline": 0.81242,
"puma": 0.45456: } ]
IDE
App code
Apps
Edge Devices
Model Storage
PUBLISHCODE CONSUME
Lifecycle Management
Processes. Templates. Permissions
Embed model
CNTK/TF/SCIKIT
KERAS/ …
Train&
testmodel
Data Lake
App telemetry
A/B
Testing
BUILD & TEST
Training+
testcode
Continuous retraining
Testmodel
+app

Model Source Control
• Processes and procedures to make models
reproducible (from source control to data
retention policies)
• Make it easy to work on multiple models
(consistent process)

Model Validation
• Unit testing, functional testing and
performance testing
• Validation needs to be performed both
isolation and when embedded in an
application

Model Versioning & Storage
• Provide a consistent way to store & share
models, plus a way to track where models are
embedded / running
• Provide a consistent model format
• Provide traceability on where a model came
from (which data, which experiment, where’s
the code / notebook)
• Provide a way to track where model is running
• Control who has access to what models

Model Deployment
• Provide an efficient process to get a model build into an
application or service and leveraged to light up an end-user
scenario.
• Simplify the process to interact with the model (through code-
generation, API specifications / interfaces or other methods)
• Support a variety of inferencing targets (cloud / app / edge)
(including FPGAs or dedicated frameworks like CoreML & WinML)
• Provide secrets / service endpoint management to remove
friction from configuring the release process.

Accumulate a toolbox of tricks
Tip #8

• Data Exploration
• RFM – User Behavior Modeling
• Hyper parameter tuning
• Auto Featurization
Note: Domain expertise is still
helpful
Building an Org’s Toolbox

Lots of common sense… but not common
practice

Thank you!
Also thanks to Pavandeep Kalra, Jacob Spolstra, Wee Hyong
Tok, Richin Jain, Brandon Rohrer

Managing Data Science Projects

In this document