Data in science

Sreejith Aravindakshan,
Consultant, CIMMYT and Wageningen University, Netherlands
1
A date with DATA:
Getting to know more about data analysis and models

“Data is the new oil”
• Data is a collection of facts, such as numbers, words, measurements, observations or
even just descriptions of things
• Data is all around us. But what exactly is it?
Data is a value assigned to a thing. Color, Shape, Number,
Condition, Size
QUALITATIVE DATA : is everything that refers to the
quality of something: A description of colours, texture and
feel of an object, a description of experiences, and
interviews are all qualitative data.
QUANTITATIVE DATA : is data that refers to a number.
E.g. the number of golf balls, the size, the price, a score
on a test etc.
2

3
• Categorical data is qualitative in nature
• Numerical (quantitative) data of both discrete and continuous nature can be interval or ratio data also
• Interval data has ordered values with same difference but lack a true zero value e.g. Temperature. PH.
• Ratio data are also ordered values with same difference but has a true zero value e.g. height, weight.

Categorical Data : puts the item you are describing
into a category: For example, the condition “used”
would be categorical and also categories such as
“new”, “used”, ”broken” etc.
Discrete Data : is numerical data that has gaps in
it: e.g. the count of golf balls. There can only be
whole numbers of golf ball (there is no such thing
as 0.3 golf balls).
Continuous Data : is numerical data with a
continuous range: e.g. size of the golf balls can be
any value (e.q. 10.55 mm or 10.61 mm but also
10.536 mm). In continuous data, all values are
possible with no gaps in between.
Primary Data
Secondary Data
4

5
Hypothesis
Sampling
Data Collection
Data Entry
Data Cleaning
Theory
Research Design
Data storage
What are the steps?

Sampling
Probability (Random)
Non-probability (purposive)
6

• From researchers’ experience
 Can result in wide confidence interval
or measurement error
• Using some formula
For instance, Cochran’s formula for sample size
calculation:
𝑛0 =
𝑍2
𝑝𝑞
𝑒2
Where:
 e is the desired level of precision (i.e.
the margin of error or confidence interval),
 p is the (estimated) proportion of the
population which has the attribute in question,
 q is 1 – p.
Determining the ideal sample size
7

Example
 Suppose we are doing a study on the inhabitants of a large town or village, and want to
find out how many households serve breakfast in the mornings. We don’t have much
information on the subject to begin with, so we’re going to assume that half of the
families serve breakfast: this gives us maximum variability. So p = 0.5. Now let’s say we
want 95% confidence level, and at least 5 percent—plus or minus—precision. A 95 %
confidence level gives us Z values of 1.96, from the table values, so we get
 ((1.96)2 *(0.5) *(0.5)) / (0.05)2 =
384.16 ~ 385.
 So a random sample of 385 households in our target population should be enough to
give us the confidence levels we need.
8

Both Accurate
and Precise
Accurate
Not precise
Not accurate
But precise
Neither accurate
nor precise
• Accuracy refers to how close measurements are to the "true" value
• Precision refers to how close measurements are to each other
Data accuracy vs. precision
9

Independent Variable: The variable in the study
under consideration. The cause for the outcome
for the study.
Dependent Variable: The variable being
affected by the independent variable. The
effect of the study
y = f(x)
Which is which here?
10

Principles of Data Collection
• Understanding and knowing what types of data required
• Collect only relevant data
• Determine methods of data collection
 Survey/questionnaire
 Observation, participatory
 Focus groups
 Standard instruments
 Content analysis
 Experiments/observations
 Personal interviews
 Literature search – meta analysis
11

Principles…..
• Where, who, how, and when to collect
* Research design
* Sampling procedure
* Prepare field work schedule/data plan
* Conduct preliminary (surveys) investigation
• Assess situation and prepare further strategies
12

13
 Enter the data in
MS-Excel.
 Top row with
variable labels in
each cell.
 Save the entered
data as .csv file in
MS-Excel

Data analysis has been around for a while…
R.A. Fisher
Howard Dresner
Peter LuhnW.E. Deming
Robert Gentleman
Ross Ihaka
14

Knowing your data
Descriptive/summary statistics: Mean, median, mode, standard deviation, frequencies, standard error
15

• Consider the set
• 1, 1, 2, 2, 3, 6, 7, 11, 11, 13, 14, 16, 19
• In this case there are 13 values so the median is the middle
value, or (n+1) / 2
• (13+1) /2 = 7
• 1, 1, 2, 2, 3, 6, 7, 11, 11, 13, 14, 16
• In the second case, the mean of the two middle values is the
median or (n+1) /2
(12 + 1) / 2 = 6.5 ~ (6+7) / 2 = 6.5
Median
17

The most frequent value in a data set
• 1, 1, 1, 1, 2, 2, 3, 6, 11, 11, 11, 13, 14, 16, 19
• In this case the mode is 1 because it is the most common value.
• This is a case of unimodal distrbution
• There may be cases where there are more than one mode as in this case
• 1, 1, 1, 1, 2, 2, 3, 6, 11, 11, 11, 11, 13, 14, 16, 19
• In this case there are two modes (bimodal) : 1 and 11 because both
occur 4 times in the data set.
Mode
18

R is just super cool for data analytics
21

Visualizing my scientific career using data in R
R package
“ggplot2” is
amazing!!
23

Basic regression models
y = Dependent variable (Response variable)
x = Independent variable (Explanatory or predictor variable)
𝜀 = random error component
𝛽0 = intercept
𝛽1 = Slope or coefficient of 𝑥 and 𝑥1 in linear model and
multiple regression models, respectively
𝑦 = 𝛽0 + 𝛽1 𝑥1 + 𝛽2 𝑥2 + ⋯ + 𝛽 𝑘 𝑥 𝑘 + 𝜀
𝑦 = 𝛽0 + 𝛽1 𝑥 + 𝜀
24

O
F
Y
X I
OLS Regression
SFA
DEA
Output Efficiency of F: FO/YO
Input Efficiency of F: XI/XF

Symbol Meaning Level of
significance
ns P > 0.05 Not applicable
* P ≤ 0.05 At 10% level
** P ≤ 0.01 At 5% level
*** P ≤ 0.001 At 1% level
**** P ≤ 0.0001 At 0.1% level
"p-value offers a first defense line against being fooled by randomness,
separating signal from noise"
26
Statistical significance and p-value

Chance (Random Error; Sampling Error)
Bias (Systematic Errors [inaccuracies])
 Selection bias
 Loss to follow-up bias
Information bias
• Nondifferential (e.g. simple misclassification)
• Differential Biases (e.g., recall bias, interviewer bias)
Confounding (Imbalance in Other Factors)
A situation in which the effect of two processes
are not separated.
Errors affecting validity. A
systematic error (caused by the
investigator or the subjects) that
causes an incorrect (over- or
under-) estimate of an association.
What is bias?
27

28
A word of caution:
“Interpretation can
however be
subjective”

Don’t have any strong opinion about SPSS since I am not an avid user of the
same......
29

R or others – The fight is on
A lot more documents found in Google Scholar still uses
SPSS than R while it is vice-versa in Scopus .
30

What Is R?
• a programming “environment”
• object-oriented
• similar to S-Plus
• freeware
• provides calculations on matrices
• excellent graphics capabilities
• supported by a large user network
31

What is R Not?
• a statistics software package
• menu-driven
• quick to learn
• a program with a complex graphical interface
32

Installing R
• www.r-project.org/
• download from CRAN
• select a download site
• download the base package at a minimum
• download contributed packages as needed
33

Tutorials cont.
• Textbooks
The Art of R programming by Norman Matloff Handbook of programming with R by
Garrett Grolemund
38

Disclaimer: Many of the image files used in this presentation have been downloaded from the internet. Any copyright holders who are not
duly acknowledged here may contact me for proper citation.
Contact : sreejiagriman@gmail.com

Data in science

More Related Content

What's hot

Similar to Data in science

Recently uploaded

Data in science