Engineering	the	Data	Science
2
Data is like gunpowder!
You can make a marvelous firework
OR
a dangerous weapon from it
Core	Team	Building
Lead	Data	Scientist
Understands	the	needs	of	
stakeholders,	identifies	the	
KPI,	data	sources	and	
capabilities	
Data	Engineer
Cleanses,	provide	and	ETL	
the	data
Data	Scientist								
Makes	the	data	actionable:	
exploring	the	data,	modeling,	
leveraging	the	models	and	
developing	patterns	
Application	Developer	
Controls	Enterprise	open	
source	libraries	and	versions,	
develops	and	leverage	current	
APIs
Data Data	Visualization	Specialist
Coordinates	with	data	
scientists	to	design	and	
implement	KPIs
&
Summarizing	the	data	and	increasing	ROIs
Data	Science	Delivery	Models	
Information	Delivery	/	
Data	Visualization Data	in	Mart?
User	Developed	
Application?
Requirements	
Uncertain?
Existing	
Sandbox?
Self	Service	
Analytics	&	
Reporting
Hours/Days
Days/Weeks
Weeks/Months
Months
Self	Service	
Discovery
Prototype	/	
Discovery
Data	Discovery	and	
Advanced	Analytics	
Operational	Reporting	&
Integration
Data	for	
Operational?
Data	being	
captured?
No	Statistical	
Modeling	
needed?
Business	
Performance	
Reporting
Model	
Integration
New	Subject	Area	/	Source	
System	Acquisition
Operational	System	
Enhancements
Foundational	Data	for	
Analytics
Y
Y
Y
Y
Y
Y
Y
N
N
N
N
N
N
N
ROI	/	Competitive	
Advantage
Clarity	deploys	innovative	
technology	and	advanced	
analytics	to	deliver	full	life-
cycle	analytic	solutions	that	
deliver	ROI	and	competitive	
advantage.	
Competitive	advantage	
requires	robust	scenario	
planning	and	prudent	risk	
management
Regardless	of	shape	or	structure,	
Clarity	can	turn	data	into	
insight…	enabling	a	deep	
understanding	of	your	business	
trends
Clarity	can	imbed	systemic,	
forward	looking	analytics	into	
strategic	decision	and	risk	
management	processes	to	
improve	strategy	and	reduce	risk
Clarity	can	operationalize	predictive	
insights	to	drive	proactive	decisions	
and	guide	customer	interactions	
real	time	
Descriptive	Analytics
Understand	business	performance
Discover	patterns,	explore	trends	
and	establish	relationships	to	
explain	past	performance
Predictive	Analytics
Identify	risk	and	opportunity,	guide	
decision	making
Exploit	patterns,	trends	and	
relationships	to	predict	likely	
outcome	of	future	event	or	
situation
Prescriptive	Analytics
Prescribe	action,	understand	
options,	manage	risk
Synthesize	models	with	business	
rules	and	constraints	to	understand	
implications	of	available	courses	of	
actions
From	Hopeful	to	
Thoughtful	
Competitive	advantage	
requires	acting	upon	trends	
and	insights
From	Reactive	to	
Proactive
Competitive	advantage	in	
today’s	business	
environment	requires	a	
deep	understanding	of	why,	
not	just	how	much
From Data to Insight
Forward	thinking,	data	enabled	organizations	
outperform	peers…
Key	Elements	of	Implementation	
Readable	and	Reusable	Models/Scripts
Easy	to	Migrate
How	to	Implement	
IT Strategy
Hypothesis driven approach
Departmental integration and collaboration
Version	Control
8
Open	Source
Proprietary
Code	Check:	BAD	Code
9
dat_dropship<- read.csv("~/Projects/STB_dropships_Apr2015_csv.csv", colClasses = "character")
dat_dropship_march <- read.csv("~/Projects/STB_dropships_Mar2015_cvs.csv", colClasses =
"character")
dat_dropship_feb <- read.csv("~/Projects/STB_dropships_Feb2015_cvs.csv", colClasses =
"character")
dat_dropship_jan<- read.csv("~/Projects/STB_dropships_Jan2015_csv.csv", colClasses =
"character")
dat_pass <- read.csv("~/Projects/AprilRepairSTBlatest_circuit.csv", colClasses = "character")
dat_all_Apr_June <- read.csv("~/Projects/Apr-Jun2015STBRepair_allBItables.csv", colClasses =
"character")
dat_pass_Apr_June<-dat_all_Apr_June
# dat_pass_Apr_June<-subset(dat_all_Apr_June,dat_all_Apr_June$FIRST_TEST_RESULT_IND=="FAIL")
##All STBs Repair return type
dat_Repair_Apr_June<-subset(dat_all_Apr_June,dat_all_Apr_June$RETURN_TYPE=="Repair")
dat_Repair_Apr_June_US<-dat_Repair_Apr_June[grep("^US",
dat_Repair_Apr_June$RETURN_LOCATION),]
dat_Repair_Apr_June_US<-
subset(dat_Repair_Apr_June_US,dat_Repair_Apr_June_US$RETURN_AGENT!="V000000")
dat_Repair_Apr_June_US<-
subset(dat_Repair_Apr_June_US,dat_Repair_Apr_June_US$RETURN_AGENT!="AUTOSYNC")
dat_Repair_Apr_June_US<-
subset(dat_Repair_Apr_June_US,dat_Repair_Apr_June_US$RETURN_AGENT!="UNDOC")
Repair_Apr_Jun_Loc_all<-split(dat_Repair_Apr_June,dat_Repair_Apr_June$RETURN_LOCATION)
Locs_Repair_Apr_Jun<-sapply(Repair_Apr_Jun_Loc_all, nrow)
Repair_Apr_Jun_Loc<-split(dat_Repair_Apr_June_US,
paste(dat_Repair_Apr_June_US$RETURN_LOCATION,dat_Repair_Apr_June_US$RETURN_AGENT))
Locs_Agent_Repair_Apr_Jun<-sapply(Repair_Apr_Jun_Loc, nrow)
Code	Check:	Problem
10
-Nearly impossible to debug
-Endless Copy and Pastes
-Hard to Control the version
-Impossible to work with a Peer
Code	Check:	What	to	Do
11
- Version control on a shared platform
- Library Version Control
- Functional and Modular Coding

Software Engineering for Data Scientists