Module 1: Introduction to
Data analytics and life cycle
CSC601.1: Comprehend basics of data analytics and
visualization.
10 Marks weightage
CONTENTS
● Data Analytics Lifecycle overview:
○ Key Roles for a Successful Analytics
○ Background and Overview of Data Analytics Lifecycle Project
● Phase 1: Discovery: Learning the Business Domain, Resources Framing the
Problem, Identifying Key Stakeholders. Interviewing the Analytics Sponsor,
Developing Initial Hypotheses Identifying Potential Data Sources
● Phase 2: Data Preparation: Preparing the Analytic Sandbox, Performing
ETLT, Learning About the Data, Data Conditioning, Survey and visualize,
Common Tools for the Data Preparation Phase
CONTENTS
● Phase 3: Model Planning: Data Exploration and Variable Selection, Model
Selection ,Common Tools for the Model Planning Phase
● Phase 4: Model Building: Common Tools for the Model Building Phase
● Phase 5: Communicate Results
● Phase 6: Operationalize
Current Analytical Architecture
Current Analytical Architecture
1. For data sources to be loaded into the data warehouse, data needs to be well
understood, structured, and normalized with the appropriate data type definitions.
2. As a result of this level of control on the EDW, additional local systems may
emerge in the form of departmental warehouses and local data marts that
business users create to accommodate their need for flexible analysis.
3. Once in the data warehouse, data is read by additional applications across the
enterprise for Bl and reporting purposes. These are high-priority operational
processes getting critical data feeds from the data warehouses and repositories.
4. At the end of this workflow, analysts get data provisioned for their downstream
analytics. Because users generally are not allowed to run custom or intensive
analytics on production databases, analysts create data extracts from the EDW to
analyze data offline in R or other local analytical tools.
Data evolution and the rise of Big Data sources
The data now comes from multiple sources, such as these:
● Medical information, such as genomic sequencing and diagnostic imaging
● Photos and video footage uploaded to the World Wide Web
● Video surveillance, such as the thousands of video cameras spread across a city
● Mobile devices, which provide geospatial location data of the users, as well as
metadata about text messages, phone calls, and application usage on
smartphones
● Smart devices, which provide sensor-based collection of information from smart
electric grids, smart buildings, and many other public and industry infrastructures
● Nontraditional IT devices, including the use of radio-frequency identification
(RFID) readers, GPS navigation systems, and seismic processing
Key roles for a successful analytics project
Key roles for a successful analytics project
Business User: Someone who understands the domain area and usually benefits
from the results. This person can consult and advise the project team on the
context of the project, the value of the results, and how the outputs will be
operationalized.
Project Sponsor: Responsible for the genesis of the project. Provides the
impetus and requirements for the project and defines the core business problem.
Generally provides the funding and gauges the degree of value from the final
outputs of the working team.
Key roles for a successful analytics project
Project Manager: Ensures that key milestones and objectives are met on time
and at the expected quality.
Business Intelligence Analyst : Provides business domain expertise based on a
deep understanding of the data, key performance indicators (KPis), key metrics,
and business intelligence from a reporting perspective. Business Intelligence
Analysts generally create dashboards and reports and have knowledge of the data
feeds and sources.
Key roles for a successful analytics project
Database Administrator (DBA): Provisions and configures the database
environment to support the analytics needs of the working team. These
responsibilities may include providing access to key databases or tables and
ensuring the appropriate security levels are in place related to the data
repositories.
Data Engineer: Leverages deep technical skills to assist with tuning SQL queries
for data management and data extraction, and provides support for data ingestion
into the analytic sandbox.
Analytics Process Best Practices
Scientific method: in use for centuries, still provides a solid framework for thinking about and
deconstructing problems into their principal parts. One of the most valuable ideas of the scientific method
relates to forming hypotheses and finding ways to test ideas.
CRISP-OM: provides useful input on ways to frame analytics problems and is a popular approach for data
mining.
Tom Davenport's DELTA framework : The DELTA framework offers an approach for data analytics
projects, including the context of the organization's skills, datasets, and leadership engagement.
Doug Hubbard's Applied Information Economics (AlE) approach: AlE provides a framework for
measuring intangibles and provides guidance on developing decision models, calibrating expert estimates,
and deriving the expected value of information.
MAD Skills by Cohen et al.: offers input for several of the techniques mentioned in Phases 2-4 that focus
on model planning, execution, and key findings.
Importance of Data Analytics Lifecycle
Data Analytics Lifecycle defines the roadmap of how data is generated, collected, processed, used, and analyzed to
achieve business goals. It offers a systematic way to manage data for converting it into information that can be used to fulfill
organizational and project goals. The process provides the direction and methods to extract information from the data and
proceed in the right direction to accomplish business goals.
Data professionals use the lifecycle circular form to proceed with data analytics in either a forward or backward direction.
Based on the newly received insights, they can decide whether to proceed with their existing research or scrap it and redo
the complete analysis. The Data Analytics lifecycle guides them throughout this process.
Data Analytics Lifecycle Overview
Data Analytics Lifecycle Overview
Phase 1- Discovery: In Phase 1, the team learns the business domain, including
relevant history such as whether the organization or business unit has attempted
similar projects in the past from which they can learn. The team assesses the
resources available to support the project in terms of people, technology, time,
and data. Important activities in this phase include framing the business problem
as an analytics challenge that can be addressed in subsequent phases and
formulating initial hypotheses (IHs) to test and begin learning the data.
Data Analytics Lifecycle Overview
Phase 2- Data preparation: Phase 2 requires the presence of an analytic
sandbox, in which the team can work with data and perform analytics for the
duration of the project. The team needs to execute extract, load, and transform
(ELT) or extract, transform and load (ETL) to get data into the sandbox. The ELT
and ETL are sometimes abbreviated as ETLT. Data should be transformed in the
ETLT process so the team can work with it and analyze it. In this phase, the team
also needs to familiarize itself with the data thoroughly and take steps to condition
the data
Data Analytics Lifecycle Overview
Phase 3-Model planning: Phase 3 is model planning, where the team determines
the methods, techniques, and workflow it intends to follow for the subsequent
model building phase. The team explores the data to learn about the relationships
between variables and subsequently selects key variables and the most suitable
models.
Data Analytics Lifecycle Overview
Phase 4-Model building: In Phase 4, the team develops datasets for testing,
training, and production purposes. In addition, in this phase the team builds and
executes models based on the work done in the model planning phase. The team
also considers whether its existing tools will suffice for running the models, or if it
will need a more robust environment for executing models and work flows (for
example, fast hardware and parallel processing, if applicable).
Data Analytics Lifecycle Overview
Phase 5-Communicate results: In Phase 5, the team, in collaboration with major
stakeholders, determines if the results of the project are a success or a failure
based on the criteria developed in Phase 1. The team should identify key findings,
quantify the business value, and develop a narrative to summarize and convey
findings to stakeholders.
Phase 6- operationalize: In Phase 6, the team delivers final reports, briefings,
code, and technical documents. In addition, the team may run a pilot project to
implement the models in a production environment.
Phase 1: Discovery
1. Learning the Business Domain
Understanding the domain area of the problem is essential. In many cases, data scientists will have deep
computational and quantitative knowledge that can be broadly applied across many disciplines. An example of
this role would be someone with an advanced degree in applied mathematics or statistics.
1. Resources
As part of the discovery phase, the team needs to assess the resources available to support the project. In
this context, resources include technology, tools, systems, data, and people. During this scoping, consider the
available tools and technology the team will be using and the types of systems needed for later phases to
operationalize the models.
1. Framing the Problem
Framing is the process of stating the analytics problem to be solved. At this point, it is a best practice to write
down the problem statement and share it with the key stakeholders.
Phase 1: Discovery
4. Identifying Key Stakeholders
Another important step is to identify the key stakeholders and their interests in the
project. During these discussions, the team can identify the success criteria, key risks,
and stakeholders, which should include anyone who will benefit from the project or will
be significantly impacted by the project.
5. Interviewing the Analytics Sponsor
The team should plan to collaborate with the stakeholders to clarify and frame the
analytics problem. At the outset, project sponsors may have a predetermined solution
that may not necessarily realize the desired outcome. In these cases, the team must
use its knowledge and expertise to identify the true underlying problem and
appropriate solution.
Phase 1: Discovery
When interviewing the main stakeholders, the team needs to take time to
thoroughly interview the project sponsor, who tends to be the one funding the
project or providing the high-level requirements.
Phase 1: Discovery
Here are some tips for interviewing project sponsors:
• Prepare for the interview; draft questions, and review with coll eagues.
• Use open-ended questions; avoid asking lead ing questions.
• Probe for details and pose follow-up questions.
• Avoid filling every silence in the conversation; give the other person time to think.
• Let the sponsors express their ideas and ask clarifying questions, such as "Why? Is that correct? Is this idea on target? Is there
anything else?"
• Use active listening techniques; repeat back what was heard to make sure the team heard it correctly, or reframe what was sa id.
• Try to avoid expressing the team's opinions, which can introduce bias; instead, focus on listening.
• Be mindful of the body language of the interviewers and stakeholders; use eye contact where appropriate, and be attentive.
• Minimize distractions.
• Document what the team heard, and review it with the sponsors.
Phase 1: Discovery
6. Developing Initial Hypotheses
Developing a set of IHs is a key facet of the discovery phase. This step involves
forming ideas that the team can test with data.
7. Identifying Potential Data Sources
As part of the discovery phase, identify the kinds of data the team will need to
solve the problem. Consider the volume, type, and time span of the data needed
to test the hypotheses.
Phase 1: Discovery
The team should perform five main activities during this step of the discovery
phase:
● Identify data sources
● Capture aggregate data sources
● Review the raw data
● Evaluate the data structures and tools needed
● Scope the sort of data infrastructure needed for this type of problem
Phase 2: Data Preparation
1. Preparing the Analytic Sandbox
The first subphase of data preparation requires the team to obtain an analytic sandbox (also
commonly referred to as a workspace), in which the team can explore the data without interfering
with live production databases.
1. Performing ETLT
For instance, consider an analysis for fraud detection on credit card usage. Many times, outliers in
this data population can represent higher-risk transactions that may be indicative of fraudulent
credit card activity. Application programming interface (API) is an increasingly popular way to
access a data source. Many websites and social network applications now provide APis that offer
access to data to support a project or supplement the datasets with which a team is working. For
example, connecting to the Twitter API can enable a team to download millions of tweets to
perform a project for sentiment analysis on a product, a company, or an idea.
Phase 2: Data Preparation
3. Learning About the Data
● Doing this activity accomplishes several goals.
● Clarifies the data that the data science team has access to at the start of the
project.
● Highlights gaps by identifying datasets within an organization that the team
may find useful but may not be accessible to the team today.
● Identifies datasets outside the organization that may be useful to obtain,
through open APis, data sharing, or purchasing data to supplement already
existing datasets
Phase 2: Data Preparation
4. Data Conditioning
Data conditioning refers to the process of cleaning data, normalizing datasets, and
performing transformations on the data. A critical step within the Data Analytics
Lifecycle, data conditioning can involve many complex steps to join or merge data
sets or otherwise get datasets into a state that enables analysis in further phases.
Phase 2: Data Preparation
Additional questions and considerations for the data conditioning step include these.
● What are the data sources? What are the target fields (for example, columns of the tables)?
● How clean is the data?
● How consistent are the contents and files? Determine to what degree the data contains
missing or
● inconsistent values and if the data contains values deviating from normal.
● Assess the consistency of the data types. For instance, if the team expects certain data to be
numeric, confirm it is numeric or if it is a mixture of alphanumeric strings and text.
● Review the content of data columns or other inputs, and check to ensure they make sense.
For instance, if the project involves analyzing income levels, preview the data to confirm that
the income values are positive or if it is acceptable to have zeros or negative values.
● Look for any evidence of systematic error.
Phase 2: Data Preparation
5. Survey and Visualize
After the team has collected and obtained at least some of the datasets needed
for the subsequent analysis, a useful step is to leverage data visualization tools to
gain an overview of the data. Seeing high-level patterns in the data enables one to
understand characteristics about the data very quickly.
One example is using data visualization to examine data quality, such as whether
the data contains many unexpected values or other indicators of dirty data.
Phase 2: Data Preparation
When pursuing this approach with a data visualization tool or statistical package, the following guidelines
and considerations are recommended.
A. Review data to ensure that calculations remained consistent within columns or across tables for a
given data field. For instance, did customer lifetime value change at some point in the middle of data
collection? Or if working with financials, did the interest calculation change from simple to compound
at the end of the year?
A. Does the data distribution stay consistent over all the data? If not, what kinds of actions should be
taken to address this problem?
A. Assess the granularity of the data, the range of values, and the level of aggregation of the data.
Phase 2: Data Preparation
D. Does the data represent the population of interest? For marketing data, if the project is focused on targeting
customers of child-rearing age, does the data represent that, or is it full of senior citizens and teenagers?
E. For time-related variables, are the measurements daily, weekly, monthly? Is that good enough? Is
time measured in seconds everywhere? Or is it in milliseconds in some places? Determine the level of
granularity of the data needed for the analysis, and assess whether the current level of timestamps
on the data meets that need.
F. Is the data standardized/normalized? Are the scales consistent? If not, how consistent or irregular is
the data?
G. For geospatial datasets, are state or country abbreviations consistent across the data? Are personal
names normalized? English units? Metric units?
Phase 2: Data Preparation
Several tools are commonly used for this phase:
1. Hadoop
2. Alpine Miner
3. Open Refine
4. Data Wrangler
Phase 3: Model Planning
● After mapping out your business goals and collecting a glut of data (structured, unstructured,
or semi-structured), it is time to build a model that utilizes the data to achieve the goal. This
phase of the data analytics process is known as model planning.
There are several techniques available to load data into the system:
● ETL (Extract, Transform, and Load) transforms the data first using a set of business rules,
before loading it into a sandbox.
● ELT (Extract, Load, and Transform) first loads raw data into the sandbox and then transform
it.
● ETLT (Extract, Transform, Load, Transform) is a mixture; it has two transformation levels.
Phase 3: Model Planning
This step also includes the teamwork to determine the methods, techniques, and
workflow to build the model in the subsequent phase. The model’s building
initiates with identifying the relation between data points to select the key variables
and eventually find a suitable model.
Data sets are developed by the team to test, train and produce the data. In the
later phases, the team builds and executes the models that were created in the
model planning stage.
Phase 3: Model Planning
1. Data Exploration and Variable Selection
In Phase 3, the objective of the data exploration is to understand the relationships
among the variables to inform selection of the variables and methods and to
understand the problem domain.
1. Model Selection
In the model selection subphase, the team's main goal is to choose an analytical
technique, or a short list of candidate techniques, based on the end goal of the
project.
Phase 3: Model Planning
Common Tools for the Model Planning Phase
1. R has a complete set of modeling capabilities and provides a good
environment for building interpretive models with high-quality code.
2. SQL Analysis services can perform in-database analytics of common data
mining functions, involved aggregations, and basic predictive models.
3. SAS/ACCESS provides integration between SAS and the analytics sandbox
via multiple data connectors such as ODBC, JDBC, and OLE DB.
Phase 4: Model Building
In this step of data analytics architecture comprises developing data sets for testing, training, and
production purposes. The data analytics experts meticulously build and operate the model that they had
designed in the previous step. They rely on tools and several techniques like decision trees, regression
techniques (logistic regression), and neural networks for building and executing the model. The experts
also perform a trial run of the model to observe if the model corresponds to the datasets.
It helps them determine whether the tools they have currently are going to sufficiently execute the model
or if they need a more robust system for it to work properly.
Questions to consider include these:
• Does the model appear valid and accurate on the test data?
• Does the model output/behavior make sense to the domain experts? That is, does it appear as if the
model is giving answers that make sense in this context?
Phase 4: Model Building
● Do the parameter values of the fitted model make sense in the context of the
domain?
● Is the model sufficiently accurate to meet the goal?
● Does the model avoid intolerable mistakes? Depending on context, false positives
may be more serious or less serious than false negatives
● Are more data or more inputs needed? Do any of the inputs need to be
transformed or eliminated?
● Will the kind of model chosen support the runtime requirements?
● Is a different form of the model required to address the business problem? If so, go
back to the model planning phase and revise the modeling approach.
Phase 4: Model Building
Common Tools for the Model Building Phase
Commercial Tools:
● SAS Enterprise Miner
● SPSS Modeler (provided by IBM and now called IBM SPSS
Modeler)
● Matlab
● Alpine Miner
● STATISTICA
● Mathematica
Phase 4: Model Building
Free or Open Source tools:
● R and PL/R , and PL/R is a procedural language for PostgreSQL with
R. Using this approach means that R commands can be executed in
database.
● Oct ave
● WEKA
● Python
● SQL in-database implementations, such as MADlib
Phase 5: Communicate Results
The communication step starts with a collaboration with major stakeholders to
determine if the project results are a success or failure. The project team is
required to identify the key findings of the analysis, measure the business value
associated with the result, and produce a narrative to summarise and convey the
results to the stakeholders.
Phase 5: Communicate Results
● After executing the model, the team needs to compare the outcomes of
the modeling to the criteria established for success and failure.
● In Phase 5, the team considers how best to articulate the findings and
outcomes to the various team members and stakeholders, taking into
account caveats, assumptions, and any limitations of the results.
● Because the presentation is often circulated within an organization, it is
critical to articulate the results properly and position the findings in a way
that is appropriate for the audience
Phase 6: Operationalize
● In the final phase, the team communicates the benefits of the project more broadly and
sets up a pilot project to deploy the work in a controlled way before broadening the work to
a full enterprise or ecosystem of users.
● In Phase 4, the team scored the model in the analytics sandbox.
● Phase 6, represents the first time that most analytics teams approach deploying the new
analytical methods or models in a production environment.
EXAMPLE
Consider an example of a retail store chain that wants to optimize its products’ prices to boost its revenue.
The store chain has thousands of products over hundreds of outlets, making it a highly complex scenario.
Once you identify the store chain’s objective, you find the data you need, prepare it, and go through the
Data Analytics lifecycle process.
You observe different types of customers, such as ordinary customers and customers like contractors who
buy in bulk. According to you, treating various types of customers differently can give you the solution.
However, you don’t have enough information about it and need to discuss this with the client team.
In this case, you need to get the definition, find data, and conduct hypothesis testing to check whether
various customer types impact the model results and get the right output. Once you are convinced with the
model results, you can deploy the model, and integrate it into the business, and you are all set to deploy
the prices you think are the most optimal across the outlets of the store.
CASE STUDY
EMC's Global Innovation Network and Analytics (GINA) team is a group of senior technologists
located in centers of excellence (COEs) around the world. This team's charter is to engage
employees across global COEs to drive innovation, research, and university partnerships. In 2012,
a newly hired director wanted to improve these activities and provide a mechanism to track and
analyze the related information. In addition, this team wanted to create more robust mechanisms
for capturing the results of its informal conversations with other thought leaders within EMC, in
academia, or in other organizations, which would later be mined for insights.
Global Innovation Network and
4
⚫ The GINA case study provides an example of how
a team applied the Data Analytics Lifecycle to
analyze innovation data at EMC.
⚫ Innovation is typically a difficult concept to measure,
and this team wanted to look for ways to use advanced
analytical methods to identify key innovators within the
company.
⚫ GINA is a group of senior technologists located
in centers of excellence (COEs) around the
world.
Global Innovation Network and
4
⚫ The GINA team thought its approach would
provide a means to share ideas globally and
increase knowledge sharing among GINA
members who may be separated geographically
⚫ It planned to create a data repository containing
both structured and unstructured data to
accomplish three main goals.
1. Store formal and informal data.
2. Track research from global technologists.
3. Mine the data for patterns and insights to improve the
Global Innovation Network and
4
⚫ In the GINA project’s discovery phase, the
team began identifying data sources.
⚫ Following Person are involved in this phase
1. Business user, project sponsor, project
manager – Vice President from Office of
CTO
2. BI analyst – person from IT
3. Data engineer and DBA – people from IT
4. Data scientist – distinguished engineer
Global Innovation Network and
5
⚫ The data for the project fell into two main categories.
1. Innovation Roadmap
2. data encompassed minutes and notes representing
innovation and research activity from around the world
⚫ Hypothesis
1. Descriptive analytics of what is currently happening to
spark further creativity, collaboration, and asset generation
2. Predictive analytics to advise executive management of
where it should be investing in the future.
Global Innovation Network and
5
⚫ IT department to set up a new analytics sandbox to store
and experiment on the data.
⚫ The data scientists and data engineers began to notice
that certain data needed conditioning and normalization.
⚫ As the team explored the data, it quickly realized that if it
did not have data of sufficient quality or could not get
good quality data, it would not be able to perform the
subsequent steps in the lifecycle process.
⚫ Important to determine what level of data quality and
cleanliness was sufficient for the project being undertaken
5
id Analysis
⚫ The team made a decision to initiate a longitudinal study to begin
tracking data points over time regarding people developing new
intellectual property.
⚫ The parameters related to the scope of the study included
the following considerations:
1. Identify the right milestones to achieve this goal.
2. Trace how people move ideas from each milestone toward the goal.
3. Once this is done, trace ideas that die, and trace others that reach the goal.
Compare the journeys of ideas that make it and those that do not.
4. Compare the times and the outcomes using a few different methods
(depending on how the data is collected and assembled). These could be as
simple as t-tests or perhaps nvolve different types of
classificaGtlioobnal InanlogvaotirointhNemtwosr.k an
⚫ The GINA team employed several analytical methods. This
included work by the data scientist using Natural Language
Processing (NLP) techniques on the textual descriptions of
the Innovation Roadmap ideas.
⚫ Social network analysis using R and Studio
Global Innovation Network and
53
Global Innovation Network and
54
⚫ Fig shows social graphs that portray the relationships
between idea submitters within GINA.
⚫ Each colour represents an innovator from a
different country.
⚫ The large dots with red circles around them represent hubs.
A hub represents a person with high connectivity and
a high “betweenness” score.
⚫ The team used Tableau software for data visualization and
exploration and used the Pivotal Greenplum database as
the main data repository and analytics engine.
Global Innovation Network and
55
⚫ This project was considered successful in
identifying boundary spanners and hidden
innovators.
⚫ The GINA project promoted knowledge sharing related to
innovation and researchers spanning multiple areas within
the company and outside of it. GINA also enabled EMC to
cultivate additional intellectual property that led to
additional research topics and provided opportunities to
forge relationships with universities for joint academic
research in the fields of Data Science and Big Data.
Global Innovation Network and
56
⚫ Study was successful in in identifying hidden
innovators
◦ Found high density of innovators in Cork, Ireland
⚫ The CTO office launched longitudinal studies
Global Innovation Network and
57
⚫ Deployment was not really discussed
⚫ Key findings
◦ Need more data in future
◦ Some data were sensitive
◦ A parallel initiative needs to be created to
improve basic BI activities
◦ A mechanism is needed to continually
reevaluate the model after deployment
Global Innovation Network and
58
Advantages of Data Analytics Life Cycle
● Identification of Potential Risks
Businesses operate in high-risk settings and thus need efficient risk management solutions to deal
with problems. Creating efficient risk management procedures and strategies depends heavily on
big data. Data analytics life cycle and tools quickly minimize risks by optimizing complicated
decisions for unforeseen occurrences and prospective threats.
● Reducing Cost
● Increase efficiency
IMPORTANT QUESTIONS
1. In which phase would the team expect to invest most of the project time? Why?
Where would the team expect to spend the least time?
2. What are the benefits of doing a pilot program before a full scale rollout of a new
analytical methodology? Discuss this in the context of the mini case study.
1. What kinds of tools would be used in the following phases, and for which kinds of use
scenarios?
a. Phase 2: Data preparation
b. Phase 4: Model building

MODULE 1_Introduction to Data analytics and life cycle..pptx

  • 1.
    Module 1: Introductionto Data analytics and life cycle CSC601.1: Comprehend basics of data analytics and visualization. 10 Marks weightage
  • 2.
    CONTENTS ● Data AnalyticsLifecycle overview: ○ Key Roles for a Successful Analytics ○ Background and Overview of Data Analytics Lifecycle Project ● Phase 1: Discovery: Learning the Business Domain, Resources Framing the Problem, Identifying Key Stakeholders. Interviewing the Analytics Sponsor, Developing Initial Hypotheses Identifying Potential Data Sources ● Phase 2: Data Preparation: Preparing the Analytic Sandbox, Performing ETLT, Learning About the Data, Data Conditioning, Survey and visualize, Common Tools for the Data Preparation Phase
  • 3.
    CONTENTS ● Phase 3:Model Planning: Data Exploration and Variable Selection, Model Selection ,Common Tools for the Model Planning Phase ● Phase 4: Model Building: Common Tools for the Model Building Phase ● Phase 5: Communicate Results ● Phase 6: Operationalize
  • 4.
  • 5.
    Current Analytical Architecture 1.For data sources to be loaded into the data warehouse, data needs to be well understood, structured, and normalized with the appropriate data type definitions. 2. As a result of this level of control on the EDW, additional local systems may emerge in the form of departmental warehouses and local data marts that business users create to accommodate their need for flexible analysis. 3. Once in the data warehouse, data is read by additional applications across the enterprise for Bl and reporting purposes. These are high-priority operational processes getting critical data feeds from the data warehouses and repositories. 4. At the end of this workflow, analysts get data provisioned for their downstream analytics. Because users generally are not allowed to run custom or intensive analytics on production databases, analysts create data extracts from the EDW to analyze data offline in R or other local analytical tools.
  • 6.
    Data evolution andthe rise of Big Data sources The data now comes from multiple sources, such as these: ● Medical information, such as genomic sequencing and diagnostic imaging ● Photos and video footage uploaded to the World Wide Web ● Video surveillance, such as the thousands of video cameras spread across a city ● Mobile devices, which provide geospatial location data of the users, as well as metadata about text messages, phone calls, and application usage on smartphones ● Smart devices, which provide sensor-based collection of information from smart electric grids, smart buildings, and many other public and industry infrastructures ● Nontraditional IT devices, including the use of radio-frequency identification (RFID) readers, GPS navigation systems, and seismic processing
  • 8.
    Key roles fora successful analytics project
  • 9.
    Key roles fora successful analytics project Business User: Someone who understands the domain area and usually benefits from the results. This person can consult and advise the project team on the context of the project, the value of the results, and how the outputs will be operationalized. Project Sponsor: Responsible for the genesis of the project. Provides the impetus and requirements for the project and defines the core business problem. Generally provides the funding and gauges the degree of value from the final outputs of the working team.
  • 10.
    Key roles fora successful analytics project Project Manager: Ensures that key milestones and objectives are met on time and at the expected quality. Business Intelligence Analyst : Provides business domain expertise based on a deep understanding of the data, key performance indicators (KPis), key metrics, and business intelligence from a reporting perspective. Business Intelligence Analysts generally create dashboards and reports and have knowledge of the data feeds and sources.
  • 11.
    Key roles fora successful analytics project Database Administrator (DBA): Provisions and configures the database environment to support the analytics needs of the working team. These responsibilities may include providing access to key databases or tables and ensuring the appropriate security levels are in place related to the data repositories. Data Engineer: Leverages deep technical skills to assist with tuning SQL queries for data management and data extraction, and provides support for data ingestion into the analytic sandbox.
  • 12.
    Analytics Process BestPractices Scientific method: in use for centuries, still provides a solid framework for thinking about and deconstructing problems into their principal parts. One of the most valuable ideas of the scientific method relates to forming hypotheses and finding ways to test ideas. CRISP-OM: provides useful input on ways to frame analytics problems and is a popular approach for data mining. Tom Davenport's DELTA framework : The DELTA framework offers an approach for data analytics projects, including the context of the organization's skills, datasets, and leadership engagement. Doug Hubbard's Applied Information Economics (AlE) approach: AlE provides a framework for measuring intangibles and provides guidance on developing decision models, calibrating expert estimates, and deriving the expected value of information. MAD Skills by Cohen et al.: offers input for several of the techniques mentioned in Phases 2-4 that focus on model planning, execution, and key findings.
  • 13.
    Importance of DataAnalytics Lifecycle Data Analytics Lifecycle defines the roadmap of how data is generated, collected, processed, used, and analyzed to achieve business goals. It offers a systematic way to manage data for converting it into information that can be used to fulfill organizational and project goals. The process provides the direction and methods to extract information from the data and proceed in the right direction to accomplish business goals. Data professionals use the lifecycle circular form to proceed with data analytics in either a forward or backward direction. Based on the newly received insights, they can decide whether to proceed with their existing research or scrap it and redo the complete analysis. The Data Analytics lifecycle guides them throughout this process.
  • 14.
  • 15.
    Data Analytics LifecycleOverview Phase 1- Discovery: In Phase 1, the team learns the business domain, including relevant history such as whether the organization or business unit has attempted similar projects in the past from which they can learn. The team assesses the resources available to support the project in terms of people, technology, time, and data. Important activities in this phase include framing the business problem as an analytics challenge that can be addressed in subsequent phases and formulating initial hypotheses (IHs) to test and begin learning the data.
  • 16.
    Data Analytics LifecycleOverview Phase 2- Data preparation: Phase 2 requires the presence of an analytic sandbox, in which the team can work with data and perform analytics for the duration of the project. The team needs to execute extract, load, and transform (ELT) or extract, transform and load (ETL) to get data into the sandbox. The ELT and ETL are sometimes abbreviated as ETLT. Data should be transformed in the ETLT process so the team can work with it and analyze it. In this phase, the team also needs to familiarize itself with the data thoroughly and take steps to condition the data
  • 17.
    Data Analytics LifecycleOverview Phase 3-Model planning: Phase 3 is model planning, where the team determines the methods, techniques, and workflow it intends to follow for the subsequent model building phase. The team explores the data to learn about the relationships between variables and subsequently selects key variables and the most suitable models.
  • 18.
    Data Analytics LifecycleOverview Phase 4-Model building: In Phase 4, the team develops datasets for testing, training, and production purposes. In addition, in this phase the team builds and executes models based on the work done in the model planning phase. The team also considers whether its existing tools will suffice for running the models, or if it will need a more robust environment for executing models and work flows (for example, fast hardware and parallel processing, if applicable).
  • 19.
    Data Analytics LifecycleOverview Phase 5-Communicate results: In Phase 5, the team, in collaboration with major stakeholders, determines if the results of the project are a success or a failure based on the criteria developed in Phase 1. The team should identify key findings, quantify the business value, and develop a narrative to summarize and convey findings to stakeholders. Phase 6- operationalize: In Phase 6, the team delivers final reports, briefings, code, and technical documents. In addition, the team may run a pilot project to implement the models in a production environment.
  • 20.
    Phase 1: Discovery 1.Learning the Business Domain Understanding the domain area of the problem is essential. In many cases, data scientists will have deep computational and quantitative knowledge that can be broadly applied across many disciplines. An example of this role would be someone with an advanced degree in applied mathematics or statistics. 1. Resources As part of the discovery phase, the team needs to assess the resources available to support the project. In this context, resources include technology, tools, systems, data, and people. During this scoping, consider the available tools and technology the team will be using and the types of systems needed for later phases to operationalize the models. 1. Framing the Problem Framing is the process of stating the analytics problem to be solved. At this point, it is a best practice to write down the problem statement and share it with the key stakeholders.
  • 21.
    Phase 1: Discovery 4.Identifying Key Stakeholders Another important step is to identify the key stakeholders and their interests in the project. During these discussions, the team can identify the success criteria, key risks, and stakeholders, which should include anyone who will benefit from the project or will be significantly impacted by the project. 5. Interviewing the Analytics Sponsor The team should plan to collaborate with the stakeholders to clarify and frame the analytics problem. At the outset, project sponsors may have a predetermined solution that may not necessarily realize the desired outcome. In these cases, the team must use its knowledge and expertise to identify the true underlying problem and appropriate solution.
  • 22.
    Phase 1: Discovery Wheninterviewing the main stakeholders, the team needs to take time to thoroughly interview the project sponsor, who tends to be the one funding the project or providing the high-level requirements.
  • 23.
    Phase 1: Discovery Hereare some tips for interviewing project sponsors: • Prepare for the interview; draft questions, and review with coll eagues. • Use open-ended questions; avoid asking lead ing questions. • Probe for details and pose follow-up questions. • Avoid filling every silence in the conversation; give the other person time to think. • Let the sponsors express their ideas and ask clarifying questions, such as "Why? Is that correct? Is this idea on target? Is there anything else?" • Use active listening techniques; repeat back what was heard to make sure the team heard it correctly, or reframe what was sa id. • Try to avoid expressing the team's opinions, which can introduce bias; instead, focus on listening. • Be mindful of the body language of the interviewers and stakeholders; use eye contact where appropriate, and be attentive. • Minimize distractions. • Document what the team heard, and review it with the sponsors.
  • 24.
    Phase 1: Discovery 6.Developing Initial Hypotheses Developing a set of IHs is a key facet of the discovery phase. This step involves forming ideas that the team can test with data. 7. Identifying Potential Data Sources As part of the discovery phase, identify the kinds of data the team will need to solve the problem. Consider the volume, type, and time span of the data needed to test the hypotheses.
  • 25.
    Phase 1: Discovery Theteam should perform five main activities during this step of the discovery phase: ● Identify data sources ● Capture aggregate data sources ● Review the raw data ● Evaluate the data structures and tools needed ● Scope the sort of data infrastructure needed for this type of problem
  • 26.
    Phase 2: DataPreparation 1. Preparing the Analytic Sandbox The first subphase of data preparation requires the team to obtain an analytic sandbox (also commonly referred to as a workspace), in which the team can explore the data without interfering with live production databases. 1. Performing ETLT For instance, consider an analysis for fraud detection on credit card usage. Many times, outliers in this data population can represent higher-risk transactions that may be indicative of fraudulent credit card activity. Application programming interface (API) is an increasingly popular way to access a data source. Many websites and social network applications now provide APis that offer access to data to support a project or supplement the datasets with which a team is working. For example, connecting to the Twitter API can enable a team to download millions of tweets to perform a project for sentiment analysis on a product, a company, or an idea.
  • 27.
    Phase 2: DataPreparation 3. Learning About the Data ● Doing this activity accomplishes several goals. ● Clarifies the data that the data science team has access to at the start of the project. ● Highlights gaps by identifying datasets within an organization that the team may find useful but may not be accessible to the team today. ● Identifies datasets outside the organization that may be useful to obtain, through open APis, data sharing, or purchasing data to supplement already existing datasets
  • 28.
    Phase 2: DataPreparation 4. Data Conditioning Data conditioning refers to the process of cleaning data, normalizing datasets, and performing transformations on the data. A critical step within the Data Analytics Lifecycle, data conditioning can involve many complex steps to join or merge data sets or otherwise get datasets into a state that enables analysis in further phases.
  • 29.
    Phase 2: DataPreparation Additional questions and considerations for the data conditioning step include these. ● What are the data sources? What are the target fields (for example, columns of the tables)? ● How clean is the data? ● How consistent are the contents and files? Determine to what degree the data contains missing or ● inconsistent values and if the data contains values deviating from normal. ● Assess the consistency of the data types. For instance, if the team expects certain data to be numeric, confirm it is numeric or if it is a mixture of alphanumeric strings and text. ● Review the content of data columns or other inputs, and check to ensure they make sense. For instance, if the project involves analyzing income levels, preview the data to confirm that the income values are positive or if it is acceptable to have zeros or negative values. ● Look for any evidence of systematic error.
  • 30.
    Phase 2: DataPreparation 5. Survey and Visualize After the team has collected and obtained at least some of the datasets needed for the subsequent analysis, a useful step is to leverage data visualization tools to gain an overview of the data. Seeing high-level patterns in the data enables one to understand characteristics about the data very quickly. One example is using data visualization to examine data quality, such as whether the data contains many unexpected values or other indicators of dirty data.
  • 31.
    Phase 2: DataPreparation When pursuing this approach with a data visualization tool or statistical package, the following guidelines and considerations are recommended. A. Review data to ensure that calculations remained consistent within columns or across tables for a given data field. For instance, did customer lifetime value change at some point in the middle of data collection? Or if working with financials, did the interest calculation change from simple to compound at the end of the year? A. Does the data distribution stay consistent over all the data? If not, what kinds of actions should be taken to address this problem? A. Assess the granularity of the data, the range of values, and the level of aggregation of the data.
  • 32.
    Phase 2: DataPreparation D. Does the data represent the population of interest? For marketing data, if the project is focused on targeting customers of child-rearing age, does the data represent that, or is it full of senior citizens and teenagers? E. For time-related variables, are the measurements daily, weekly, monthly? Is that good enough? Is time measured in seconds everywhere? Or is it in milliseconds in some places? Determine the level of granularity of the data needed for the analysis, and assess whether the current level of timestamps on the data meets that need. F. Is the data standardized/normalized? Are the scales consistent? If not, how consistent or irregular is the data? G. For geospatial datasets, are state or country abbreviations consistent across the data? Are personal names normalized? English units? Metric units?
  • 33.
    Phase 2: DataPreparation Several tools are commonly used for this phase: 1. Hadoop 2. Alpine Miner 3. Open Refine 4. Data Wrangler
  • 34.
    Phase 3: ModelPlanning ● After mapping out your business goals and collecting a glut of data (structured, unstructured, or semi-structured), it is time to build a model that utilizes the data to achieve the goal. This phase of the data analytics process is known as model planning. There are several techniques available to load data into the system: ● ETL (Extract, Transform, and Load) transforms the data first using a set of business rules, before loading it into a sandbox. ● ELT (Extract, Load, and Transform) first loads raw data into the sandbox and then transform it. ● ETLT (Extract, Transform, Load, Transform) is a mixture; it has two transformation levels.
  • 35.
    Phase 3: ModelPlanning This step also includes the teamwork to determine the methods, techniques, and workflow to build the model in the subsequent phase. The model’s building initiates with identifying the relation between data points to select the key variables and eventually find a suitable model. Data sets are developed by the team to test, train and produce the data. In the later phases, the team builds and executes the models that were created in the model planning stage.
  • 36.
    Phase 3: ModelPlanning 1. Data Exploration and Variable Selection In Phase 3, the objective of the data exploration is to understand the relationships among the variables to inform selection of the variables and methods and to understand the problem domain. 1. Model Selection In the model selection subphase, the team's main goal is to choose an analytical technique, or a short list of candidate techniques, based on the end goal of the project.
  • 37.
    Phase 3: ModelPlanning Common Tools for the Model Planning Phase 1. R has a complete set of modeling capabilities and provides a good environment for building interpretive models with high-quality code. 2. SQL Analysis services can perform in-database analytics of common data mining functions, involved aggregations, and basic predictive models. 3. SAS/ACCESS provides integration between SAS and the analytics sandbox via multiple data connectors such as ODBC, JDBC, and OLE DB.
  • 38.
    Phase 4: ModelBuilding In this step of data analytics architecture comprises developing data sets for testing, training, and production purposes. The data analytics experts meticulously build and operate the model that they had designed in the previous step. They rely on tools and several techniques like decision trees, regression techniques (logistic regression), and neural networks for building and executing the model. The experts also perform a trial run of the model to observe if the model corresponds to the datasets. It helps them determine whether the tools they have currently are going to sufficiently execute the model or if they need a more robust system for it to work properly. Questions to consider include these: • Does the model appear valid and accurate on the test data? • Does the model output/behavior make sense to the domain experts? That is, does it appear as if the model is giving answers that make sense in this context?
  • 39.
    Phase 4: ModelBuilding ● Do the parameter values of the fitted model make sense in the context of the domain? ● Is the model sufficiently accurate to meet the goal? ● Does the model avoid intolerable mistakes? Depending on context, false positives may be more serious or less serious than false negatives ● Are more data or more inputs needed? Do any of the inputs need to be transformed or eliminated? ● Will the kind of model chosen support the runtime requirements? ● Is a different form of the model required to address the business problem? If so, go back to the model planning phase and revise the modeling approach.
  • 40.
    Phase 4: ModelBuilding Common Tools for the Model Building Phase Commercial Tools: ● SAS Enterprise Miner ● SPSS Modeler (provided by IBM and now called IBM SPSS Modeler) ● Matlab ● Alpine Miner ● STATISTICA ● Mathematica
  • 41.
    Phase 4: ModelBuilding Free or Open Source tools: ● R and PL/R , and PL/R is a procedural language for PostgreSQL with R. Using this approach means that R commands can be executed in database. ● Oct ave ● WEKA ● Python ● SQL in-database implementations, such as MADlib
  • 42.
    Phase 5: CommunicateResults The communication step starts with a collaboration with major stakeholders to determine if the project results are a success or failure. The project team is required to identify the key findings of the analysis, measure the business value associated with the result, and produce a narrative to summarise and convey the results to the stakeholders.
  • 43.
    Phase 5: CommunicateResults ● After executing the model, the team needs to compare the outcomes of the modeling to the criteria established for success and failure. ● In Phase 5, the team considers how best to articulate the findings and outcomes to the various team members and stakeholders, taking into account caveats, assumptions, and any limitations of the results. ● Because the presentation is often circulated within an organization, it is critical to articulate the results properly and position the findings in a way that is appropriate for the audience
  • 44.
    Phase 6: Operationalize ●In the final phase, the team communicates the benefits of the project more broadly and sets up a pilot project to deploy the work in a controlled way before broadening the work to a full enterprise or ecosystem of users. ● In Phase 4, the team scored the model in the analytics sandbox. ● Phase 6, represents the first time that most analytics teams approach deploying the new analytical methods or models in a production environment.
  • 45.
    EXAMPLE Consider an exampleof a retail store chain that wants to optimize its products’ prices to boost its revenue. The store chain has thousands of products over hundreds of outlets, making it a highly complex scenario. Once you identify the store chain’s objective, you find the data you need, prepare it, and go through the Data Analytics lifecycle process. You observe different types of customers, such as ordinary customers and customers like contractors who buy in bulk. According to you, treating various types of customers differently can give you the solution. However, you don’t have enough information about it and need to discuss this with the client team. In this case, you need to get the definition, find data, and conduct hypothesis testing to check whether various customer types impact the model results and get the right output. Once you are convinced with the model results, you can deploy the model, and integrate it into the business, and you are all set to deploy the prices you think are the most optimal across the outlets of the store.
  • 46.
    CASE STUDY EMC's GlobalInnovation Network and Analytics (GINA) team is a group of senior technologists located in centers of excellence (COEs) around the world. This team's charter is to engage employees across global COEs to drive innovation, research, and university partnerships. In 2012, a newly hired director wanted to improve these activities and provide a mechanism to track and analyze the related information. In addition, this team wanted to create more robust mechanisms for capturing the results of its informal conversations with other thought leaders within EMC, in academia, or in other organizations, which would later be mined for insights.
  • 47.
    Global Innovation Networkand 4 ⚫ The GINA case study provides an example of how a team applied the Data Analytics Lifecycle to analyze innovation data at EMC. ⚫ Innovation is typically a difficult concept to measure, and this team wanted to look for ways to use advanced analytical methods to identify key innovators within the company. ⚫ GINA is a group of senior technologists located in centers of excellence (COEs) around the world.
  • 48.
    Global Innovation Networkand 4 ⚫ The GINA team thought its approach would provide a means to share ideas globally and increase knowledge sharing among GINA members who may be separated geographically ⚫ It planned to create a data repository containing both structured and unstructured data to accomplish three main goals. 1. Store formal and informal data. 2. Track research from global technologists. 3. Mine the data for patterns and insights to improve the
  • 49.
    Global Innovation Networkand 4 ⚫ In the GINA project’s discovery phase, the team began identifying data sources. ⚫ Following Person are involved in this phase 1. Business user, project sponsor, project manager – Vice President from Office of CTO 2. BI analyst – person from IT 3. Data engineer and DBA – people from IT 4. Data scientist – distinguished engineer
  • 50.
    Global Innovation Networkand 5 ⚫ The data for the project fell into two main categories. 1. Innovation Roadmap 2. data encompassed minutes and notes representing innovation and research activity from around the world ⚫ Hypothesis 1. Descriptive analytics of what is currently happening to spark further creativity, collaboration, and asset generation 2. Predictive analytics to advise executive management of where it should be investing in the future.
  • 51.
    Global Innovation Networkand 5 ⚫ IT department to set up a new analytics sandbox to store and experiment on the data. ⚫ The data scientists and data engineers began to notice that certain data needed conditioning and normalization. ⚫ As the team explored the data, it quickly realized that if it did not have data of sufficient quality or could not get good quality data, it would not be able to perform the subsequent steps in the lifecycle process. ⚫ Important to determine what level of data quality and cleanliness was sufficient for the project being undertaken
  • 52.
    5 id Analysis ⚫ Theteam made a decision to initiate a longitudinal study to begin tracking data points over time regarding people developing new intellectual property. ⚫ The parameters related to the scope of the study included the following considerations: 1. Identify the right milestones to achieve this goal. 2. Trace how people move ideas from each milestone toward the goal. 3. Once this is done, trace ideas that die, and trace others that reach the goal. Compare the journeys of ideas that make it and those that do not. 4. Compare the times and the outcomes using a few different methods (depending on how the data is collected and assembled). These could be as simple as t-tests or perhaps nvolve different types of classificaGtlioobnal InanlogvaotirointhNemtwosr.k an
  • 53.
    ⚫ The GINAteam employed several analytical methods. This included work by the data scientist using Natural Language Processing (NLP) techniques on the textual descriptions of the Innovation Roadmap ideas. ⚫ Social network analysis using R and Studio Global Innovation Network and 53
  • 54.
    Global Innovation Networkand 54 ⚫ Fig shows social graphs that portray the relationships between idea submitters within GINA. ⚫ Each colour represents an innovator from a different country. ⚫ The large dots with red circles around them represent hubs. A hub represents a person with high connectivity and a high “betweenness” score. ⚫ The team used Tableau software for data visualization and exploration and used the Pivotal Greenplum database as the main data repository and analytics engine.
  • 55.
    Global Innovation Networkand 55 ⚫ This project was considered successful in identifying boundary spanners and hidden innovators. ⚫ The GINA project promoted knowledge sharing related to innovation and researchers spanning multiple areas within the company and outside of it. GINA also enabled EMC to cultivate additional intellectual property that led to additional research topics and provided opportunities to forge relationships with universities for joint academic research in the fields of Data Science and Big Data.
  • 56.
    Global Innovation Networkand 56 ⚫ Study was successful in in identifying hidden innovators ◦ Found high density of innovators in Cork, Ireland ⚫ The CTO office launched longitudinal studies
  • 57.
    Global Innovation Networkand 57 ⚫ Deployment was not really discussed ⚫ Key findings ◦ Need more data in future ◦ Some data were sensitive ◦ A parallel initiative needs to be created to improve basic BI activities ◦ A mechanism is needed to continually reevaluate the model after deployment
  • 58.
  • 59.
    Advantages of DataAnalytics Life Cycle ● Identification of Potential Risks Businesses operate in high-risk settings and thus need efficient risk management solutions to deal with problems. Creating efficient risk management procedures and strategies depends heavily on big data. Data analytics life cycle and tools quickly minimize risks by optimizing complicated decisions for unforeseen occurrences and prospective threats. ● Reducing Cost ● Increase efficiency
  • 60.
    IMPORTANT QUESTIONS 1. Inwhich phase would the team expect to invest most of the project time? Why? Where would the team expect to spend the least time? 2. What are the benefits of doing a pilot program before a full scale rollout of a new analytical methodology? Discuss this in the context of the mini case study. 1. What kinds of tools would be used in the following phases, and for which kinds of use scenarios? a. Phase 2: Data preparation b. Phase 4: Model building