WHAT IS DATA SCIENCE ?
BY
SHILPA KRISHNA
RESEARCH SCHOLAR
Data
Science
Process
DISCOVERY
DATA
PREPARATIO
N
MODEL
PLANNIN
G
MODEL
BUILDIN
G
OPERATI
ON
COMMUNICAT
E
RESULTS
DISCOVERY
 It involves acquiring data from all the identified
internal and external sources which helps you to
answer the business question.
 The data can be :
1. Logs from webservers
2. Data gathered from social media
3. Census datasets
4. Data streamed from online sources using APIs
DATA PREPARATION
 Data can have lots of inconsistencies like
missing value,blank columns,incorrect data
format which needs to be cleaned.
 You need to process,explore and condition
data before modeling.
 The cleaner your data, the better are your
predictions.
MODEL PLANNING
 In this stage, you need to determine the
method and technique to draw the relation
between input variables.
 Planning for a model is performed by using
different statistical formulas and
visualization tools like SQL analysis
services, R and SAS/access
MODEL BUILDING
 Data scientist distributes datasets for
training and testing.
 Techniques like association, classification,
and clustering are applied to the training
dataset.
 The model once prepared is tested
against the “testing” dataset
OPERATIONALIZE
 You deliver the final baselined model with
reports,code and technical documents.
 Model is deployed into a real-time
production environment after through
testing.
COMMUNICATE RESULTS
 The key findings are communicated to all
stakeholders.
 This helps you to decide if the results of
the project are a success or a failure
based on the inputs from the model.
MOST PROMINENT DATA SCIENTIST JOB TITLES ARE :
1) Data scientist
2) Data engineer
3) Data analyst
4) Statistician
5) Data admin
6) Business analyst
Data Scientist
ROLE LANGUAGES
 It is a professional who
manages enormous
amounts of data to come
up with compelling
business visions by using
various tools, techniques,
methodologies, algorithms
etc…
 R
 SAS
 PYTHON
 SQL
 HIVE
 MATLAB
 PIG
 SPARK
Data Engineer
ROLE LANGUAGES
 He is working with large
amounts of data and
develops constructs,
tests and maintains
architectures like large
scale processing system
and databases.
 SQL
 HIVE
 R
 SAS
 MATLAB
 PYTHON
 JAVA
 RUBY
 C++
 PERL
Data Analyst
ROLE LANGUAGES
 Responsible for mining vast
amounts of data and look
for relationships, patterns,
trends in data.
 Later deliver compeling
reporting and visualization
for analyzing the data to
take the most viable
business decisions.
 R
 PYTHON
 HTML
 JS
 C
 C++
 SQL
Statistician
ROLE LANGUAGES
 Collects, analyses,
understand qualitative
and quantitative data by
using statistical theories
and methods.
 SQL
 R
 MATLAB
 TABLEAU
 PYTHON
 PERL
 SPARK
 HIVE
Data Administrator
ROLE LANGUAGES
 Data admin should
ensure that the database
is accessible to all
relevant users also
makes sure that it is
performing correctly and
is being kept safe from
hacking
 RUBY on Rails
 SQL
 JAVA
 C#
 PYTHON
Business Analyst
ROLE LANGUAGES
 This professional need to
improves business
processes and He is an
intermediary between the
business executive team
and IT department
 SQL
 TABLEAU
 POWER BI
 PYTHON
DEFINE THE GOAL
 Define a measurable and quantifiable goal
 Goal should be specific and precise
 Goal is come up with candidate
hypothesis. These hypothesis can then be
turned into concrete questions or goals for
a full-scale modeling project.
COLLECT AND MANAGE DATA
 Time consuming step
 Conduct initial exploration and
visualization of the data
 Clean data: repair data errors and
transform variables as needed
BUILD THE MODEL
Most common data science modeling tasks are
 Classification
 Scoring
 Ranking
 Clustering
 Finding relations
 Characterization
EVALUATE AND CRITIQUE MODEL
Once you have a model, you need to
determine if it meets your goals :
 Is it accurate enough for your needs ?
 Does it perform better than the obvious
guess ?
 Do the results of the model make sense in
the context of the problem domain ?
PRESENT RESULTS AND DOCUMENT
 Present results to your project sponser
and other stakeholders.
 Document the model for those in the
organization who are responsible for
using running and maintaining the model
once it has been deployed.
DEPLOY MODEL
 Make sure that the model can be updated
as its environment changes.
 The model initially be deployed in a small
pilot program.
Several ways of gathering data for
analysis are :
 CSV FILE
 FLAT FILE(tab, space
or any other separator)
 TEXT FILE(In a single
file- reading data all at
once) or (reading data
line by line)
 ZIP FILE
 APIs(JSON)
 MULTIPLE TEXT
FILE(data is split over
multiple text files)
 DOWNLOAD FILE
FROM INTERNET(file
hosted on a server)
 WEBPAGE(scraping)
 RDBMS(SQL tables)
 Relational database uses tables which
are called Records
 Establish connections among records by
using primary key and foreign key
 Allows users to establish defined
relationships between tables
 In RDBMS, we use SQL instructions to
reproduce and analyze data separately
SOME COMMONLY USED PLOTS FOR EDA ARE :
 Histogram
 Scatter plots
 Maps
 Feature corelation plot(Heatmap)
 Time series plots
Data management platforms enables
organizations and enterprises to use data
analytics in beneficial ways, such as :
 Personalizing the customer experience
 Adding value to customer interactions
 Improving customer engagement
 Increasing customer loyalty
 Reaping and revenues associated with data
driven marketing
 Identifying the root causes of marketing failures
and business issues in real time

Data science | What is Data science

  • 1.
    WHAT IS DATASCIENCE ? BY SHILPA KRISHNA RESEARCH SCHOLAR
  • 2.
  • 3.
    DISCOVERY  It involvesacquiring data from all the identified internal and external sources which helps you to answer the business question.  The data can be : 1. Logs from webservers 2. Data gathered from social media 3. Census datasets 4. Data streamed from online sources using APIs
  • 4.
    DATA PREPARATION  Datacan have lots of inconsistencies like missing value,blank columns,incorrect data format which needs to be cleaned.  You need to process,explore and condition data before modeling.  The cleaner your data, the better are your predictions.
  • 5.
    MODEL PLANNING  Inthis stage, you need to determine the method and technique to draw the relation between input variables.  Planning for a model is performed by using different statistical formulas and visualization tools like SQL analysis services, R and SAS/access
  • 6.
    MODEL BUILDING  Datascientist distributes datasets for training and testing.  Techniques like association, classification, and clustering are applied to the training dataset.  The model once prepared is tested against the “testing” dataset
  • 7.
    OPERATIONALIZE  You deliverthe final baselined model with reports,code and technical documents.  Model is deployed into a real-time production environment after through testing.
  • 8.
    COMMUNICATE RESULTS  Thekey findings are communicated to all stakeholders.  This helps you to decide if the results of the project are a success or a failure based on the inputs from the model.
  • 10.
    MOST PROMINENT DATASCIENTIST JOB TITLES ARE : 1) Data scientist 2) Data engineer 3) Data analyst 4) Statistician 5) Data admin 6) Business analyst
  • 11.
    Data Scientist ROLE LANGUAGES It is a professional who manages enormous amounts of data to come up with compelling business visions by using various tools, techniques, methodologies, algorithms etc…  R  SAS  PYTHON  SQL  HIVE  MATLAB  PIG  SPARK
  • 12.
    Data Engineer ROLE LANGUAGES He is working with large amounts of data and develops constructs, tests and maintains architectures like large scale processing system and databases.  SQL  HIVE  R  SAS  MATLAB  PYTHON  JAVA  RUBY  C++  PERL
  • 13.
    Data Analyst ROLE LANGUAGES Responsible for mining vast amounts of data and look for relationships, patterns, trends in data.  Later deliver compeling reporting and visualization for analyzing the data to take the most viable business decisions.  R  PYTHON  HTML  JS  C  C++  SQL
  • 14.
    Statistician ROLE LANGUAGES  Collects,analyses, understand qualitative and quantitative data by using statistical theories and methods.  SQL  R  MATLAB  TABLEAU  PYTHON  PERL  SPARK  HIVE
  • 15.
    Data Administrator ROLE LANGUAGES Data admin should ensure that the database is accessible to all relevant users also makes sure that it is performing correctly and is being kept safe from hacking  RUBY on Rails  SQL  JAVA  C#  PYTHON
  • 16.
    Business Analyst ROLE LANGUAGES This professional need to improves business processes and He is an intermediary between the business executive team and IT department  SQL  TABLEAU  POWER BI  PYTHON
  • 19.
    DEFINE THE GOAL Define a measurable and quantifiable goal  Goal should be specific and precise  Goal is come up with candidate hypothesis. These hypothesis can then be turned into concrete questions or goals for a full-scale modeling project.
  • 20.
    COLLECT AND MANAGEDATA  Time consuming step  Conduct initial exploration and visualization of the data  Clean data: repair data errors and transform variables as needed
  • 21.
    BUILD THE MODEL Mostcommon data science modeling tasks are  Classification  Scoring  Ranking  Clustering  Finding relations  Characterization
  • 22.
    EVALUATE AND CRITIQUEMODEL Once you have a model, you need to determine if it meets your goals :  Is it accurate enough for your needs ?  Does it perform better than the obvious guess ?  Do the results of the model make sense in the context of the problem domain ?
  • 23.
    PRESENT RESULTS ANDDOCUMENT  Present results to your project sponser and other stakeholders.  Document the model for those in the organization who are responsible for using running and maintaining the model once it has been deployed.
  • 24.
    DEPLOY MODEL  Makesure that the model can be updated as its environment changes.  The model initially be deployed in a small pilot program.
  • 26.
    Several ways ofgathering data for analysis are :  CSV FILE  FLAT FILE(tab, space or any other separator)  TEXT FILE(In a single file- reading data all at once) or (reading data line by line)  ZIP FILE  APIs(JSON)  MULTIPLE TEXT FILE(data is split over multiple text files)  DOWNLOAD FILE FROM INTERNET(file hosted on a server)  WEBPAGE(scraping)  RDBMS(SQL tables)
  • 28.
     Relational databaseuses tables which are called Records  Establish connections among records by using primary key and foreign key  Allows users to establish defined relationships between tables  In RDBMS, we use SQL instructions to reproduce and analyze data separately
  • 30.
    SOME COMMONLY USEDPLOTS FOR EDA ARE :  Histogram  Scatter plots  Maps  Feature corelation plot(Heatmap)  Time series plots
  • 32.
    Data management platformsenables organizations and enterprises to use data analytics in beneficial ways, such as :  Personalizing the customer experience  Adding value to customer interactions  Improving customer engagement  Increasing customer loyalty  Reaping and revenues associated with data driven marketing  Identifying the root causes of marketing failures and business issues in real time