Data Science Project Lifecycle
Jason Geng @Data Application Lab
Miya Du @Data Science Association
Business
Requirement
Data
Acquisition
Data
Preparation
Hypothesis &
Modeling
Evaluation &
Interpretation
Deployment
Operations
Optimization
Business Requirements
u Data scientists need to work with business people and
those with expertise in understanding the data,
understanding the business
u Specify the business requirements
u For instance, the healthcare data
e.g. ‘DISCWT’:
‘This the discharge-level weight
on the HCUP nationwide data to
produce national estimates’
Understand the data:
Understand the Business:
Goal:
Predict Readmission Rate
Database:
Healthcare:
Readmissions Database
Modeling
Data Collection
u Data from product line
u Purchase third party data
u Social media (Facebook, LinkedIn)
u Web crawling
u Open source data (Opendata, U.S. Census Data)
Challenge
Data Storage
Data Management
Legacy data
OLTP Web Log
Web Crawler
Open Source
Third Party
Data
Social Media
Data
XML
CSV
LOG
SQL
…
Product Line
Business
Intelligence
Data Science
App
Data Preparation (Data Wrangling)
u Cleaning data (semantic errors, missing entries, or inconsistent
formatting)
u Challenge: data integration
u 80% time in project workflow
Data
Source A
Data
Source B
Data
Source B
ETL
Data
Warehouse
Feature Engineering
Select or
creating
features
Research
feature
relevance
Experiment
and
validation
Change the
feature set
Go back to
feature
selection
step
Modeling
Reference Source: http://scikit-learn.org/stable/tutorial/machine_learning_map/
Deploy to Product Line
Thank you!
https://www.DataAppLab.com
Feb 2017
PPT: Xiaolu Zhao @ Feb 16, 2017

Data Science Project Lifecycle