Descriptive Statistics
& Sampling Methods
Dr. Swati Patel (M.Sc,PhD)
Descriptive Statistics
• Descriptive Statistics can be defined as a method of understanding
and presenting collected data according to type of data
i.e presentation of data with the help of table, chart and its
summarization using measure of central tendency and dispersion.
• Qualitative data can be presented as percentage or proportion
whereas, quantitative data can be presented or summarize using
measure of central tendency and Dispersion.
Qualitative Data
• Qualitative data is based on narrative
information, not numerically ‘measurable’
information (e.g., “What does age 47 feel
like?”)
Example - Details of Malaria cases are given ,this study included total
578 positive cases of malaria in SURAT in 2019,which includes patent's
Name ,Resident ,Month of case, Age, Sex, Co-morbidities present or
not ,Type of Malaria,………
• In the given study the Qualitative Variables are Residence, Sex , Month,
Co- Morbidity, Type of Malaria , Whereas Age is quantitative data.
• Sex and Residence is Binary or Dichotomous variable , whereas Months
and Type of Malaria is Categorical .
• Qualitative variables can be represented as percentage or proportion.
Out put( Presentation of data) of the given data of Malaria is as
Months a %
June 134 23%
July 111 19%
August 123 21%
Sept 53 9%
Oct 39 7%
NOV 101 17%
Dec 17 3%
Table.1 Month wise details of Malaria cases
Table -.2 Gender wise distribution of malaria cases
Sex Numbers %
Male 334 58%
Female 244 42%
Here months are qualitative (categorical data) –graphically it can be
represented as either simple bar chart or pie chart
* The given table is known as frequency table
Sex is qualitative (binary data) – graphically which
can be represented as either simple bar chart or pie chart
Similarly , we can prepare frequency table for other qualitative variables
Table :-3
Sex
Type of Malaria
Falsi Vivax Mixe Total
Male 122 114 98 334
Female 104 89 51 244
Total 226 203 149 578
Details of Type of Malaria and Sex
•The given table is known as cross table, u can make it when u have two
Qualitative variable either two ordinal , nominal or Binary
* When we have more than two qualitative data (i.e here type of malaria
and sex) - Graphically it can be represented by Multiple Bar graph or
subdivided bar graph
Distribution of malaria cases month wise
% wise presentation
Number wise presentation
Details of Type of Malaria and Sex
When we have two qualitative
variables( i.e type of malaria
and sex ) wecan use either
multiple bar diagram or
subdivided bar diagram
Multiple bar Diagram
Subdivided bar diagram
We can also prepared the graphs of previous slide like this also,
Quantitative Data
Presentation
OR
Summarization
There are two measures to summarize
the quantitative data
•Measure of Central Tendency (Mean , Median , Mode)
•Measure of Dispersion( Range, Inter Quartile Range, Standard Deviation)
MEASURE OF CENTRAL TENDANCY
MEAN
MEDIAN MODE
The average of the data
The middle value of the data most commonly occurring value
Objectives:
•Students should know which of the measures of
central tendency(mode, median, mean) are
appropriate for the different types of variables.
•Students should be able to determine the mode,
median, or mean from data presented in a table or
graph.
Relation in between Measure of central tendency
MEAN
UNGROUPED DATA GROUPED DATA
MEAN =
n
x
n
i


 1
i
x
n
x
n
i


 1
i
ix
f
MEAN =
ns
observatio
of
no.
total
ns
observatio
of
Sum
Ungrouped Data:-
Raw data or only the information regarding the Quantitative variable (i.e.
weight, height, BMI , Age, ……….) is ungrouped data.
Birth weight of new born are :
3.3kg, 3.4kg, 3.3kg,3.8kg,2.7kg,4.1kg,3.4kg,3.9kg,5.1kg,3.3kg,…………………
Grouped Data:-
The raw data is categorized in to various groups after collection of data ,and
has been organized in frequency distribution (i.e Specific variable and its
frequency)….
Day of
confinement
No of
patients
6 5
7 4
8 4
9 3
10 2
Age of
persons
at death
No. died
(f)
0-5 100
5-10 40
10-15 20
15-20 25
20-25 20
25-30 23
30-35 12
35-40 15
Discrete Grouped data
Continuous Grouped data
Duration of sickness (in days) in 10 patients is given:-
9, 7, 8 ,10 ,73 ,5 ,6 ,8 ,9 , 8
Mean = 14.3
Median = 8 days
Example of Ungrouped Data
Calculation of Mean for
Ungrouped and Grouped
Data
•Birth weight of new born :
3.3kg,6.1kg, 5.8kg,3.8kg,2.7kg,4.1kg,3.4kg,3.9kg,5.1kg,3kg.
n
x
n
i


 1
i
x
=41.2/10
=4.12 kg
•Tuberculin test reaction of 10 boys is arranged in ascending order being measured
In millimeters. Find the mean size of reaction.
3,5,7,7,8,8,9,10,11,12
Mean = 3+5+7+7+8+8+9+10+11+12/10
= 80/10
= 8 mm
Example of Ungrouped Data
For Grouped data(Discrete)
Find mean days of confinement after delivery in the following series:-
Day of
confinement
No. of
patients
6 5
7 4
8 4
9 3
10 2
Solution:-
Day of
confinement
(x)
No. of
patients
(f) X*f
6 5 30
7 4 28
8 4 32
9 3 27
10 2 20
n
x
n
i


 1
i
i x
f
=137/18
=7.61
* Find the mean age at the death in years in the following series:-
Age of
persons
at death
No. died
(f)
0-5 100
5-10 40
10-15 20
15-20 25
20-25 20
25-30 23
30-35 12
35-40 15
Age of persons
at death
No. died
(f) Mid-pint(x)
0-5 100
5-10 40
10-15 20
15-20 25
20-25 20
25-30 23
30-35 12
35-40 15
2
limit
lower
limit
Upper 
0 - 5
Upper limit
Lower limit
* When you have continuous grouped data very first step is to check the
class-inter , is it continues or not .(discontinuous class interval should
convert discontinuous to continuous class interval( + 0.5 to upper limit and
- 0.5 to lower limit)
Age of
persons
at death
No. died
(f)
mid point
(x) fx
0-5 100 2.5 250
5-10 40 7.5 300
10-15 20 12.5 250
15-20 25 17.5 437.5
20-25 20 22.5 450
25-30 23 27.5 632.5
30-35 12 32.5 390
35-40 15 37.5 562.5
255 3272.5
n
x
n
i


 1
i
i x
f
= 12.83333
Sum
n
fi 



n
i 1
i
ix
f
Median
The middle value in a set of data. Exactly half of the values lie below it,
and half lie above it.
The median can be used as a measure of central tendency for
ordinal or quantitative variables.
Perceptions of Importance of Flu
Vaccination for Pregnant Women
Visiting Prenatal Clinic #, (%)
Flu Vaccination:
Not at all important 40 (10%)
Somewhat important 80 (20%)
Very important 280 (70%)
Total 400 (100%)
There are 400 observations: the median is the 200th
-201st
observation,
which falls in the “Very important” category
For ungrouped data:-
Step-1 Arranged data in ascending or descending order.
Step:-2 If total no. of observations ‘n’ is even then used the following formula for
median= arithmetic mean of two middle observations.
Step:-3 If total no. of observations ‘n’ is odd then used the following formula for
median
.
2
1
n
observatio
th
n 

The number of patients that visited a doctor for consultation for 10 consecutive days
is arranged in an increasing order in the following table .Find out median number of
the patients that visited the doctor per day.
8,10,12,14,16,18,19,20,22,25.
n = 10
Arithmetic mean of 10 observations = 10/2 = 5 th observation
Arithmetic mean of 10 observations+1 =6 th observation
Median = 5th
observation + 6th
observation / 2
= 16+18/2
= 17
Calculate the median for the following series :-
2,3,5,1,4,5,8
1,2,3,4,5,5,8.
Median .
2
1
n
observatio
th
n 

Median:-
f
h
c
n
l








2
l = lower limit of class interval where the median occurs
f = Frequency of the class where median occurs
h = Width of the median class
C= Cumulative frequency of the class preceding the median class
Formula for continuous grouped data
For grouped Data:-
Class interval Frequency
5-9 2
10-14 11
15-19 26
20-24 17
25-29 8
30-34 6
35-39 3
40-44 2
45-49 1
Calculate the median for the following data series:-
In the given example the class interval is Discontinuous …..
Convert it in to discontinuous to continuous
Solution:-
Class
interval Frequency
cumulative
frequency
5-9 4.5-9.5 2 2
10-14 9.5-14.5 11 13
15-19 14.5-19.5 26 39
20-24 19.5-24.5 17 56
25-29 24.5-29.5 8 64
30-34 29.5-34.5 6 70
35-39 34.5-39.5 3 73
40-44 39.5-44.5 2 75
45-49 44.5-49.5 1 76
n=76
n/2= 38
l = 14.5
h = 5
f = 26
C = 13
f
h
c
n
l
Median









2 =19.31
f
cf
The mode can be obtained for any type of
variable, whether nominal, ordinal, or quantitative
(continuous or discrete).
However, the mode is the only measure of central
tendency that can used for nominal data.
Mode
Mode for ungrouped data:-
2,2,3,4,6,7,4,4,4,4,8,9,0 mode is 4
10,10,3,3,4,2,1,6,7 mode is 10 and 3
10,34,23,12,11,3,4 no mode
-In some cases ,for extraneous reasons, the interest is to identify a value
which is most common.
-Consider incubation period of measles. Perhaps ,to be able to prevent
disease in highest number of cases, it is desirable that the representative
value is estimated by the most common period of incubation .Then in this
case mode will be appropriate measure of central tendency.
Example: the mode for a nominal variable
Chief complaints of a sample of patients presenting to the Emergency Department (n=614) in
November, 2008
Chief Complaints Frequency Percent
Chest Pain 183 29.8%
Trauma/Accident 137 22.3%
Belly Pain 98 16.0%
Childbirth 44 7.2%
Dyspnea 41 6.7%
Fever 39 6.3%
Other 72 11.7%
Total 614 100.0%
The mode is “Chest Pain
Mode:- c
f
f
f
f
f
l
m
m












)
(
2
)
(
2
1
1
l = lower limit of the class where mode occurs
fm= maximum frequency of the class interval observed
f1 = frequency of the class preceding to modal class
f2 = frequency of the class interval succeeding to modal class
C= class- interval
Formula for continuous grouped data
For grouped data
Calculate the mode for the following frequency distribution:-
IQ Range Frequency
90-100 11
100-110 27
110-120 36
120-130 38
130-140 43
140-150 28
150-160 16
160-170 1
fm
f1
f2
Modal class by inspection is 130-140
fm= 43
f1= 38
f2= 28
C=10
l = 130
c
f
f
f
f
f
l
m
m












)
(
2
)
(
2
1
1
=130.6579
The most important factor to consider is:
the level of measurement or type of variable
Nominal variables: the mode is the only appropriate measure of
central tendency.
Ordinal variables: both the mode and median are used.
Quantitative variables: mode, median, and mean can all be used.
Mean is usually the measure of choice because it is a unique
value, it uses all values in the data set, and it can be used in
subsequent analyses. However, the mean is subject to extreme
(high or low) values (outliers). If the data set contains extreme
values, calculate median.
Measure of Dispersion(Variation)
MEASURES OF VARIATION
RANGE
QUARTILE
DEVIATION
MEAN
DEVIATION
STANDARD
DEVIATION
Coefficient
of
variation
RANGE =MAX VALUE – MIN VALUE
Ex. Hb % per 100 cc of 15 persons was as follows .Calculate the range.
11.5,13.8,14.3,11.7,13.1,14.5,11.8,14.0,14.7,12.5,14.1,14.8,12.9,14.2,14.9
Step-1 Arrange the data in ascending or descending order.
Step-2 Range = highest value-lowest value.
Quartile Deviation:- In this method the ,the series is divide in four equal part or
Quarters .These are represented as Q1 ,Q2 and Q3 .The distance between the
third quartile and first quartile represent the quartile deviation.
Q1 Q2 Q3
Lowest
observation
Highest
observation
Q.D for ungrouped data:-
2
1
3 Q
Q
Q


Where Q = Q.D
Q3 =3rd
quartile
Q1 = 1st
quartile
Q.D for ungrouped data:-
Find the Q.D for the following data series:-
8,12,13,9,11,17,23,25,20,21,27.
Step:-1 Arrange the data in ascending order
8,9,11,12,13,17,20,21,23,25,27.
Step-2 Find out first and third quartile
  value
N
Q
value
N
Q
th
th
4
1
3
&
4
1
3
1







 

Step-4 Q1 = 11+1/4 = 3 rd value = 11
Q3= 3(11+1)/4 = 9 th value = 23
Step- 5 2
1
3 Q
Q
Q


= 23-11/2 = 6
Q.D for grouped data (Discrete series) :-
Step-1 Calculate the frequency and c.f from the data given.
Step:-2 Calculate the lower and upper quartile heights by using the formula
 
4
1
3
&
4
1
3
1







 

N
Q
N
Q
Step:-3 Apply the value of Q1 and Q3 in the formula
2
1
3 Q
Q
Q


Eg) Frequency distribution of height in cm of 387 students in a school is given
in the table below .Find the inter quartile range and Q.D of height distribution.
Height in cmc No. of students
150 28
152 40
154 52
156 100
158 60
160 48
162 32
164 20
166 7
Height in cmc
No. of
students c.f
150 28 28
152 40 68
154 52 120
156 100 220
158 60 280
160 48 328
162 32 360
164 20 380
166 7 387
Step:-1 Find the C.F
Step:-2
 
4
1
3
&
4
1
3
1







 

N
Q
N
Q
Q1 of the data = 97th
students and Q3 of the data = 291st
students
97 student is included in the 3rd
group having height 154 cm.
291 student is including in the 6th
group having height 160 cm
Q1 = 154, Q3= 160 .
Inter Quartile range = (Q1-Q3) = 154- 160cm
Q.D = 3
Q.D for the grouped data ( continuous series):-
h
f
cf
N
l
Q
h
f
cf
N
l
Q




















4
3
4
3
1
Apply the value of Q1 and Q3 in the formula
2
1
3 Q
Q
Q


Eg. Water percentage in the body of species of fish and their
frequency is given in the table below. Calculate the Q.D.
sr.no Class interval fre
1 16-20 4
2 21-25 3
3 26-30 8
4 31-35 9
5 36-40 14
6 41-45 3
7 46-50 3
8 51-55 2
9 56-60 2
10 61-65 2
sr.no
Class
interval fre c.f
1 16-20 15.5-20.5 4 4
2 21-25 20.5-25.5 3 7
3 26-30 25.5-30.5 8 15 Q1
4 31-35 30.5-35.5 9 24
5 36-40 35.5-40.5 14 38 Q3
6 41-45 40.5-45.5 3 41
7 46-50 45.5-50.5 3 44
8 51-55 50.5-55.5 2 46
9 56-60 55.5-60.5 2 48
10 61-65 60.5-65.5 2 50
N= 50
Q1 = N/4 = 12.5
Q3 = 3N/4 = 37.5
l = lower limit of class-interval in which Q1 lies = 25.5
l = lower limit of class-interval in which Q3 lies = 36.5
h
f
cf
N
l
Q 









4
1
l= 25.5
Cf= 7
f = 8
h = 5
= 28.93
h
f
cf
N
l
Q 









4
3
3
l = 37.5
Cf =24
f = 14
h = 5
=40.32
Q = 5.69
Mean deviation:-As the mean of all the deviations in a given set of data
obtained from an average.
M.D for ungrouped data :-  
N
X
X
 
Calculate the mean deviation from the following data :-
X 15 17 19 25 30 35 48
X (X- mean)=deviation
15 -12
17 -10
19 -8
25 -2
30 3
35 8
48 21
64
M.D=
 
N
X
X
 
Mean= 27
M.D for grouped data :-
Calculate the M.D for the given data series :-
Class-interval fre
0-4 4
4-8 6
8-12 8
12-16 5
16-20 2
Solution:-
Class-interval frequency mid-value fx
x-
mea
n l F*(x-mean) l
0-4 4 2 8 -7.2 -28.8
4-8 6 6 36 -3.2 -19.2
8-12 8 10 80 0.8 6.4
12-16 5 14 70 4.8 24
16-20 2 18 36 8.8 17.6
96
Mean =
n
x
n
i


 1
i
i x
f
= 9.2
Sum of multiplication of each frequency
and deviation from mean.
n
X
X
f
D
M
n
i
i
i



 1
)
(
.
= 3.84
Standard deviation:- S.D is an important measure of dispersion.
•A large S.D shows that the measurements of the frequency distribution are
widely spread out from the mean..
eg. 10 mm in case of BP.
• A small S.D shows that the measurements of the frequency distribution are
closely spread in the neighborhood of mean.
eg 2cm in case of height.
• SD helps us to predict how far a given value is always from mean.
Use of SD:-
When populations are combined or when samples are combined their
SD pooled after appropriate reasoning.
Eg. Comparison of Surgical treatment for cancer lung from two studies
may require knowledge of respective SD which are subsequently
pooled
Calculate SD for ungrouped data:-
 
1
2




n
X
X
SD
For grouped data (discrete series):-
 
1
2




n
X
X
f
SD
For grouped data (continuous):-
 
n
X
X
f
SD
 

2
OR
2
2












n
fx
n
fx
SD
Ex. Find the SD, variance and SE of the ESR ,found to be
3,4,5,4,2,4,5 and 3 in 8 normal individuals.
3
4
5
4
2
4
5
3
i
X 2
)
( X
Xi 
Step- 1 , Calculate the mean of x.
sum
n
x
n
i


 1
i
x
sum
Ex. Find the SD of the ESR ,found to be 3,4,5,4,2,4,5 and 3 in 8 normal individuals
3
4
0.0625
5 1.5625
4 0.0625
2 3.0625
4 0.0625
5 1.5625
3 0.5625
30 7.5
3.75
i
X 2
)
( X
Xi 
MEAN
sum
0.5625
 
1
2




n
X
X
SD
=
7
5
.
7
= 1.03
Class- interval Frequency
16-27 2
27-38 3
38-49 4
49-60 4
60-71 3
71-82 7
82-93 4
Total 27
Ex. Calculate the SD , Variance and SE for the following data .
Class-
interval Fre Mid-point(x) (x-Mean of X)2
F*(x-mean of x)2
16-27
2
27-38
3
38-49
4
49-60
4
60-71
3
71-82
7
82-93
4
Total
27
n Calculate
the sum
Class- interval Frequency Mid-point(x) (x-Mean of X)^2 f(x-mean of x)^2
16-27
2
27-38
3
38-49
4
49-60
4
60-71
3
71-82
7
82-93
4
Total
27
n
Mean of x
 
 
2
X
X
f
 
1
2




n
X
X
f
SD
Age
(years )
No. of Pts.
(f)
25 - 34
35 - 44
45 - 54
55 - 64
15
25
8
2
50
Calculate the SD , Variance and SE for the following data .
Coefficient of variation:-
it is one of the useful terms which is used to compare the variability
of two diverse population with different units of Measure like height
by weight , BP by blood cell diameters.
100


mean
SD
CV
It express the size of SD in relation to the size of mean and further converted to
percentage.
Ex. In a Series of boys ,the mean systolic BP was 120 and SD was 10 .In the
same series mean height and SD were 160 cm and 5 cm ,respectively. Find
which character show greater variation?
CV of BP = 8.3%
CV of height = 3.1%
Thus , BP found to be a more character than height, 8.3/3.1 =2.7times.
Ex)The study was conducted to know the effect of Vit.D3 supplementation
in DM type-2 patients ,which includes age, sex,vit D3 pre &post,HbA1c
pre& post, FBS pre and post ……….
This study includes qualitative and quantitative both variables, where
gender is only qualitative (binary) variable reaming all others are
quantitative, Which can be represented by Mean and Standard deviation.
Study Variables N Mean SD
Age 78 56 9.152221
Vit D3 level 78 27.94987 13.53687
HbA1C level 78 7.964156 1.804317
FBS__pre 78 140.7143 42.39472
PPBS_pre 78 210.4935 74.03726
Vit D3_post 78 34.62117 11.79473
HbA1C_post 78 7.350649 1.616659
Quantitative Variables which represented by Mean and SD
Note: when some extreme values present in the given data(quantitative) ,which can be
represented as MEDIAN &INTERQUARTILE RANHGE instead of Mean and SD
Sampling Techniques
D ATA COLLECTION METHOD
What is Sampling?
• Sampling is a statistical procedure that is
concerned with the selection of the
individual observation; it helps us to make
statistical inferences about the population.
What is Population?
Population is an entire group of study.
+ve patients of HIV in Surat
city
Population
What is Sample?
• Sample is the part of Population.
+ve patients
of HIV in
Surat city
+ve patient of HIV
under taking the
treatment in SMIMER
Population
Sample
Population
Sample
Sample is subset of
population
Sample
Target population
Study population
Why sampling?
Get information about large populations
 Less costs
 Less field time
 More accuracy i.e. Can Do A Better Job of
Data Collection
 When it’s impossible to study the whole
population
Types of sampling
• Non-probability sampling
• Probability sampling
Sampling Techniques:-
Probability sampling
1)Simple random sampling
2) Systematic sampling
3) Stratified random sampling
4)Cluster sampling
5) Multistage sampling
6) Multiphase sampling
Simple random sampling
What is it?
Every individual of population has an equal chance to be
selected.
When we can apply?
When the population is Small, Homogeneous and readily available.
Eg) Patients coming to the Hospital or admitted in the ward.
SRS
Lottery
Method
Random
Number Table
Table of random numbers
6 8 4 2 5 7 9 5 4 1 2 5 6 3 2 1 4 0
5 8 2 0 3 2 1 5 4 7 8 5 9 6 2 0 2 4
3 6 2 3 3 3 2 5 4 7 8 9 1 2 0 3 2 5
9 8 5 2 6 3 0 1 7 4 2 4 5 0 3 6 8 6
…………………….
EX) Select a sample of 10 from a population of 300 female patients
attending the MCH.
---- Step 1 ) 300 is the three digit figure.
First three rows of the random table are chosen.
034 ,977 ,167 , 125 , 555 , 162 , 844 , 630 , 332 , 576 .
The number selected for the sample will be
34 , 77,167 ,125,255,162,244,32,276,
If some numbers repeated ,they can be rejected .
Systematic sampling
Population
Large ,Scattered and Homogeneous
Process of selection of sample:-
desired
Size
Sample
Population
Total
Fraction
Sampling
K 

10% of sample to be taken out of 1000 population
10
1000
%
10
1000


of
K
Step- 1 Calculate the K.
Step:- 2 Select any one number randomly (from random no. table) from 1 to 10.
Step:- 3 Supposing it is 6 .
Step :- 4 for second sample no 10+6 = 16
For third sample 16+10 =26
26+10 = 36 and so on.
Stratified Sampling
Population Large and not Homogeneous
The population first we divided in the homogeneous group
That groups or classes are called strata
Cluster sampling
Cluster Is a randomly selected group
Cluster: a group of sampling units close to each other i.e. crowding
together in the same area or neighborhood
Cluster sampling is an example of 'two-stage sampling' .
*First stage a sample of areas is chosen;
•Second stage a sample of respondents within those areas is selected.
*Population divided into clusters of homogeneous units, usually
based on geographical contiguity.
*Sampling units are groups rather than individuals.
*A sample of such clusters is then selected.
*All units from the selected clusters are studied.
Advantages :
Cuts down on the cost of preparing a sampling
frame.
This can reduce travel and other administrative
costs.
Disadvantages: sampling error is higher for a
simple random sample of same size.
Often used to evaluate vaccination coverage in
EPI
•Identification of clusters
–List all cities, towns, villages & wards of cities with their
population falling in target area under study.
–Calculate cumulative population & divide by 30, this gives
sampling interval.
–Select a random no. less than or equal to sampling interval
having same no. of digits. This forms 1st
cluster.
–Random no.+ sampling interval = population of 2nd
cluster.
–Second cluster + sampling interval = 4th
cluster.
–Last or 30th
cluster = 29th
cluster + sampling interval
• Freq c f cluster
• I 2000 2000 1
• II 3000 5000 2
• III 1500 6500
• IV 4000 10500 3
• V 5000 15500 4, 5
• VI 2500 18000 6
• VII 2000 20000 7
• VIII 3000 23000 8
• IX 3500 26500 9
• X 4500 31000 10
• XI 4000 35000 11, 12
• XII 4000 39000 13
• XIII 3500 44000 14,15
• XIV 2000 46000
• XV 3000 49000 16
• XVI 3500 52500 17
• XVII 4000 56500 18,19
• XVIII 4500 61000 20
• XIX 4000 65000 21,22
• XX 4000 69000 23
• XXI 2000 71000 24
• XXII 2000 73000
• XXIII 3000 76000 25
• XXIV 3000 79000 26
• XXV 5000 84000 27,28
• XXVI 2000 86000 29
• XXVII 1000 87000
• XXVIII 1000 88000
• XXIX 1000 89000 30
• XXX 1000 90000
• 90000/30 = 3000 sampling interval
Multi stage Sampling
Employee in large country survey
In the first stage random no. of district are chosen in all the stage
Then talukas ,
villages
Then third stage units will be houses.
All ultimate units (houses, for
instance) selected at last step are
surveyed.
MULTI PHASE SAMPLING
Part of the information collected from whole
sample & part from subsample.
In Tb survey MT in all cases – Phase I
X –Ray chest in MT +ve cases – Phase II
Sputum examination in X – Ray +ve cases -
Phase III
Survey by such procedure is less costly, less
laborious & more purposeful
Multiphase sampling:-
In Tuberculosis
Survey
First Phase Physical examination or
Manteux test
(In +ve patients )
Chest X-ray may be done in
Mantoux +ve test
Sputum may be examine in X-ray
+ve cases
Non probablity Sampling Methods
•Convenience Sampling
•Quota sampling
•Purposive sampling
•It is non probability sampling.
•Sample is selected as a matter of convenience not
bases on the probability theory .
•For example , in clinical practice , doctors might uses
patients who are available to him/her.
Convenience Sampling
Involves sampling a quota of units to be selected from
each population cell based on the judgment of the
researchers and/or decision makers
Steps
1) Divide the population into segments (referred to
as cells) based on certain control characteristics
2) Determine the quota of units for each cell (quotas
are determined by the researchers and/or decision
makers)
3) Instruct the interviewers to fill the quotas assigned
to the cells
Quota Sampling
•Purposive sampling
•If some characteristics of the population are
known as a
result of previous survey, samples are chosen
by purposive selection .
•As result ,certain features of sample selected
purposively are likely to tally with those
of population .
•Also due to scarcity of time , limitation of
investigators and scarcity of funds.
Sampling and Non-Sampling Errors…
Two major types of error can arise when a sample of
observations is taken from a population:
sampling error and no sampling error.
Sampling error refers to differences between the sample
and the population that exist only because of the
observations that happened to be selected for the sample.
Random and we have no control over.
Non sampling errors are more serious and are due to
mistakes made in the acquisition of data or due to the
sample observations being selected improperly. Most likely
caused be poor planning, sloppy work, act of the Goddess
of Statistics, etc.
Sampling Error…
Sampling error refers to differences between
the sample and the population that exist only
because of the observations that happened to
be selected for the sample.
Increasing the sample size will reduce this type
of error.
Non sampling errors are more serious and are due to
mistakes made in the acquisition of data or due to the
sample observations being selected improperly.
Three types of non sampling errors:
Errors in data acquisition,
Nonresponse errors, and
Selection bias.
Note: increasing the sample size will not reduce this
type of error.
Non sampling Error…
5.104
Errors in data acquisition…
• …arises from the recording of incorrect
responses, due to:
• — incorrect measurements being taken because of faulty
equipment,
• — mistakes made during transcription from primary sources,
• — inaccurate recording of data due to misinterpretation of
terms, or
• — inaccurate responses to questions concerning sensitive
issues.
5.105
Nonresponse Error…
• …refers to error (or bias) introduced when
responses are not obtained from some members
of the sample, i.e. the sample observations that
are collected may not be representative of the
target population.
• As mentioned earlier, the Response Rate (i.e. the
proportion of all people selected who complete the
survey) is a key survey parameter and helps in the
understanding in the validity of the survey and
sources of nonresponse error.
5.106
Selection Bias…
• …occurs when the sampling plan is such
that some members of the target
population cannot possibly be selected for
inclusion in the sample.
Exercise for Journal
1) What is the median of the following set of scores?
18, 6, 12, 10, 14
2) We consider observations reporting the eye color of a group of 15
people: Brown, Brown, Blue, Brown, Green, Gray, Blue, Blue, Green,
Brown, Gray, Brown, Brown, Blue, Green.
1.Construct a frequency table.
2. Draw the associated bar graph.
3) Ten patients at a doctor’s surgery wait for the following lengths of
times to see their doctor. 5 mins ,17 mins, 8 mins ,2 mins, 55 mins, 9
mins, 22 mins ,11mins, 16 ,5 mins .What are the mean, median and mode
for these data? What measure of central tendency would you use here?
4) Calculate the mean and standard deviation of the following set of data.
Birth weight of ten babies (in kilograms) 2.977 3.155 3.920 3.412 4.236
2.593 3.270 3.813 4.042 3.387
5. In a survey of sleep apnea scores among 10
people, the highest sample of 58 was entered by
mistake as 85. This will affect the result as
1.Increased mean, increased median
2.Increased mean, no change in median
3.Non-change in mean, increase median
4.Increased mean, decreased median
1.Histogram
2.Line diagram
3.Box and Whisker plot
4.Kaplan Meyer plot
Identify the diagram shown
Name the given graph
Likert scale is?
1.Ordinal scale
2.Nominal scale
3.Variance scale
4.Categorical scale
The individual in a village population is
arranged alphabetically and every
8th
person is selected for the study. The type
of study is
1.Simple random sampling
2.Stratified random sampling
3.Systemic random sampling
A study done on a group of patients showed a coefficient of
variance of BP and serum creatinine to be 20% & 15%
respectively. Inference is that
1.Variation of BP is more than in serum creatinine
2.Variation in serum creatinine is more than in BP
3.The standard deviation of BP is more than of creatinine
4.The standard deviation of creatinine is more than of BP
Which is not a measure Of dispersion?
1.Mean deviation
2.Standard deviation
3.Mode
4.Range
Scatter diagram represents
1.Frequency of occurrence
2.Trend over time
3.Correlation / Association
4.None of the above
Research selected all possible samples from a
population and plotted their means on a line graph.
This distribution is called as
1.Sample distribution
2.Sampling distribution
3.Population distribution
4.Parametric distribution
Measuring relative variation between two different units is done by
1.variance
2.coefficient of variation
3.standard deviation
4.range
The median weight of 100 children was 12 kg and it formed a
normal distribution. The standard deviation was 3. Calculate the
percentage of coefficient of variation.
1.25%
2.35%
3.45%
4.55%
Stratified sampling is ideal for
1.Heterogenous data
2.Homogenous data
3.Both
4.None
Which of the following is/are non-random sampling
methods-
a) Quota sampling
b) Stratified random sampling
c) Convenience Sampling
d) Cluster Sampling
1.ab
2.bc
3.ac
4.cd
10.True statements with regard to sampling-
a) Snowball sampling is used for a hidden population
b) More sample in systemic random sampling
c) In stratified random sampling, the population is divided
into strata
d) Cluster sampling is less cost-effective
1.ab
2.bc
3.ac
4.cd
11.The upper and lower limit of standard errors
within which a parameter value is expected to lie
are called as
1.confidence interval
2.confidence limit
3.precision levels
4.accuracy limit
Ans (confidence limit)
12.Evidence-based medicine, which of the following is not useful –
a) Personal exposure
b) RCT
c) Case report
d) Meta-analysis
e) Systemic review
1.ab
2.bc
3.ac
4.cd
Ans (ac)
13.If mean is 230 and standard error is 10 then 95% of
confidence limit is
1.210-250
2.250-290
3.290-330
4.190-210
Ans (210-250)
17.Which one of the following is not a measure of dispersion –
1.Mean
2.Range
3.Mean deviation
4.Standard deviation
Ans (Mean)
19.In a normal curve, the area of one
standard deviation around the mean
includes which of the following
percent of values in a distribution –
1.0.486
2.0.683
3.0.954
4.0.997
Ans (0.683)
Of a set of values is that value which
occurs most frequently.
(a) Mean
(b) The Mode
(c) Median
(d) Standard deviation.
The SD is an appropriate measure of spread when centre is measured
with the
a) Mean
b) Median
c) Mode
d) None of the above
The PEFR of a group of 11 year old
girls follow a normal distribution with
mean 300 l/min and standard
deviation 20 l/min:
A. About 95% of the girls have PEFR
between 260 and 340 l/min
B. The girls have healthy lungs
C. About 5% of girls have PEFR below
260 l/min
D. All the PEFR must be less than 340
l/min
Correct answer : A. About 95% of the
girls have PEFR between 260 and 340
l/min
•95.4% of values lie within 2 SD
(standard deviation) of the mean
•Here, SD = 20 l/min
•Hence 95.4% of values lie within
300-(2*20) and 300=(2*20)
•Which translates into : About 95%
of the girls have PEFR between 260
and 340 l/min

Descriptive statistics and sampling Methods ).ppt

  • 1.
    Descriptive Statistics & SamplingMethods Dr. Swati Patel (M.Sc,PhD)
  • 2.
    Descriptive Statistics • DescriptiveStatistics can be defined as a method of understanding and presenting collected data according to type of data i.e presentation of data with the help of table, chart and its summarization using measure of central tendency and dispersion. • Qualitative data can be presented as percentage or proportion whereas, quantitative data can be presented or summarize using measure of central tendency and Dispersion.
  • 3.
  • 4.
    • Qualitative datais based on narrative information, not numerically ‘measurable’ information (e.g., “What does age 47 feel like?”)
  • 5.
    Example - Detailsof Malaria cases are given ,this study included total 578 positive cases of malaria in SURAT in 2019,which includes patent's Name ,Resident ,Month of case, Age, Sex, Co-morbidities present or not ,Type of Malaria,………
  • 6.
    • In thegiven study the Qualitative Variables are Residence, Sex , Month, Co- Morbidity, Type of Malaria , Whereas Age is quantitative data. • Sex and Residence is Binary or Dichotomous variable , whereas Months and Type of Malaria is Categorical . • Qualitative variables can be represented as percentage or proportion.
  • 7.
    Out put( Presentationof data) of the given data of Malaria is as Months a % June 134 23% July 111 19% August 123 21% Sept 53 9% Oct 39 7% NOV 101 17% Dec 17 3% Table.1 Month wise details of Malaria cases Table -.2 Gender wise distribution of malaria cases Sex Numbers % Male 334 58% Female 244 42% Here months are qualitative (categorical data) –graphically it can be represented as either simple bar chart or pie chart * The given table is known as frequency table Sex is qualitative (binary data) – graphically which can be represented as either simple bar chart or pie chart Similarly , we can prepare frequency table for other qualitative variables
  • 8.
    Table :-3 Sex Type ofMalaria Falsi Vivax Mixe Total Male 122 114 98 334 Female 104 89 51 244 Total 226 203 149 578 Details of Type of Malaria and Sex •The given table is known as cross table, u can make it when u have two Qualitative variable either two ordinal , nominal or Binary * When we have more than two qualitative data (i.e here type of malaria and sex) - Graphically it can be represented by Multiple Bar graph or subdivided bar graph
  • 9.
    Distribution of malariacases month wise % wise presentation Number wise presentation
  • 10.
    Details of Typeof Malaria and Sex When we have two qualitative variables( i.e type of malaria and sex ) wecan use either multiple bar diagram or subdivided bar diagram Multiple bar Diagram Subdivided bar diagram
  • 11.
    We can alsoprepared the graphs of previous slide like this also,
  • 12.
  • 13.
    There are twomeasures to summarize the quantitative data •Measure of Central Tendency (Mean , Median , Mode) •Measure of Dispersion( Range, Inter Quartile Range, Standard Deviation)
  • 14.
    MEASURE OF CENTRALTENDANCY MEAN MEDIAN MODE The average of the data The middle value of the data most commonly occurring value
  • 16.
    Objectives: •Students should knowwhich of the measures of central tendency(mode, median, mean) are appropriate for the different types of variables. •Students should be able to determine the mode, median, or mean from data presented in a table or graph.
  • 17.
    Relation in betweenMeasure of central tendency
  • 18.
    MEAN UNGROUPED DATA GROUPEDDATA MEAN = n x n i    1 i x n x n i    1 i ix f MEAN = ns observatio of no. total ns observatio of Sum
  • 19.
    Ungrouped Data:- Raw dataor only the information regarding the Quantitative variable (i.e. weight, height, BMI , Age, ……….) is ungrouped data. Birth weight of new born are : 3.3kg, 3.4kg, 3.3kg,3.8kg,2.7kg,4.1kg,3.4kg,3.9kg,5.1kg,3.3kg,…………………
  • 20.
    Grouped Data:- The rawdata is categorized in to various groups after collection of data ,and has been organized in frequency distribution (i.e Specific variable and its frequency)…. Day of confinement No of patients 6 5 7 4 8 4 9 3 10 2 Age of persons at death No. died (f) 0-5 100 5-10 40 10-15 20 15-20 25 20-25 20 25-30 23 30-35 12 35-40 15 Discrete Grouped data Continuous Grouped data
  • 21.
    Duration of sickness(in days) in 10 patients is given:- 9, 7, 8 ,10 ,73 ,5 ,6 ,8 ,9 , 8 Mean = 14.3 Median = 8 days Example of Ungrouped Data
  • 22.
    Calculation of Meanfor Ungrouped and Grouped Data
  • 23.
    •Birth weight ofnew born : 3.3kg,6.1kg, 5.8kg,3.8kg,2.7kg,4.1kg,3.4kg,3.9kg,5.1kg,3kg. n x n i    1 i x =41.2/10 =4.12 kg •Tuberculin test reaction of 10 boys is arranged in ascending order being measured In millimeters. Find the mean size of reaction. 3,5,7,7,8,8,9,10,11,12 Mean = 3+5+7+7+8+8+9+10+11+12/10 = 80/10 = 8 mm Example of Ungrouped Data
  • 24.
    For Grouped data(Discrete) Findmean days of confinement after delivery in the following series:- Day of confinement No. of patients 6 5 7 4 8 4 9 3 10 2 Solution:- Day of confinement (x) No. of patients (f) X*f 6 5 30 7 4 28 8 4 32 9 3 27 10 2 20
  • 25.
    n x n i    1 i i x f =137/18 =7.61 *Find the mean age at the death in years in the following series:- Age of persons at death No. died (f) 0-5 100 5-10 40 10-15 20 15-20 25 20-25 20 25-30 23 30-35 12 35-40 15
  • 26.
    Age of persons atdeath No. died (f) Mid-pint(x) 0-5 100 5-10 40 10-15 20 15-20 25 20-25 20 25-30 23 30-35 12 35-40 15 2 limit lower limit Upper  0 - 5 Upper limit Lower limit * When you have continuous grouped data very first step is to check the class-inter , is it continues or not .(discontinuous class interval should convert discontinuous to continuous class interval( + 0.5 to upper limit and - 0.5 to lower limit)
  • 27.
    Age of persons at death No.died (f) mid point (x) fx 0-5 100 2.5 250 5-10 40 7.5 300 10-15 20 12.5 250 15-20 25 17.5 437.5 20-25 20 22.5 450 25-30 23 27.5 632.5 30-35 12 32.5 390 35-40 15 37.5 562.5 255 3272.5 n x n i    1 i i x f = 12.83333 Sum n fi     n i 1 i ix f
  • 28.
    Median The middle valuein a set of data. Exactly half of the values lie below it, and half lie above it. The median can be used as a measure of central tendency for ordinal or quantitative variables. Perceptions of Importance of Flu Vaccination for Pregnant Women Visiting Prenatal Clinic #, (%) Flu Vaccination: Not at all important 40 (10%) Somewhat important 80 (20%) Very important 280 (70%) Total 400 (100%) There are 400 observations: the median is the 200th -201st observation, which falls in the “Very important” category
  • 29.
    For ungrouped data:- Step-1Arranged data in ascending or descending order. Step:-2 If total no. of observations ‘n’ is even then used the following formula for median= arithmetic mean of two middle observations. Step:-3 If total no. of observations ‘n’ is odd then used the following formula for median . 2 1 n observatio th n  
  • 30.
    The number ofpatients that visited a doctor for consultation for 10 consecutive days is arranged in an increasing order in the following table .Find out median number of the patients that visited the doctor per day. 8,10,12,14,16,18,19,20,22,25. n = 10 Arithmetic mean of 10 observations = 10/2 = 5 th observation Arithmetic mean of 10 observations+1 =6 th observation Median = 5th observation + 6th observation / 2 = 16+18/2 = 17 Calculate the median for the following series :- 2,3,5,1,4,5,8 1,2,3,4,5,5,8. Median . 2 1 n observatio th n  
  • 31.
    Median:- f h c n l         2 l = lowerlimit of class interval where the median occurs f = Frequency of the class where median occurs h = Width of the median class C= Cumulative frequency of the class preceding the median class Formula for continuous grouped data
  • 32.
    For grouped Data:- Classinterval Frequency 5-9 2 10-14 11 15-19 26 20-24 17 25-29 8 30-34 6 35-39 3 40-44 2 45-49 1 Calculate the median for the following data series:- In the given example the class interval is Discontinuous ….. Convert it in to discontinuous to continuous
  • 33.
    Solution:- Class interval Frequency cumulative frequency 5-9 4.5-9.52 2 10-14 9.5-14.5 11 13 15-19 14.5-19.5 26 39 20-24 19.5-24.5 17 56 25-29 24.5-29.5 8 64 30-34 29.5-34.5 6 70 35-39 34.5-39.5 3 73 40-44 39.5-44.5 2 75 45-49 44.5-49.5 1 76 n=76 n/2= 38 l = 14.5 h = 5 f = 26 C = 13 f h c n l Median          2 =19.31 f cf
  • 34.
    The mode canbe obtained for any type of variable, whether nominal, ordinal, or quantitative (continuous or discrete). However, the mode is the only measure of central tendency that can used for nominal data. Mode
  • 35.
    Mode for ungroupeddata:- 2,2,3,4,6,7,4,4,4,4,8,9,0 mode is 4 10,10,3,3,4,2,1,6,7 mode is 10 and 3 10,34,23,12,11,3,4 no mode -In some cases ,for extraneous reasons, the interest is to identify a value which is most common. -Consider incubation period of measles. Perhaps ,to be able to prevent disease in highest number of cases, it is desirable that the representative value is estimated by the most common period of incubation .Then in this case mode will be appropriate measure of central tendency.
  • 36.
    Example: the modefor a nominal variable Chief complaints of a sample of patients presenting to the Emergency Department (n=614) in November, 2008 Chief Complaints Frequency Percent Chest Pain 183 29.8% Trauma/Accident 137 22.3% Belly Pain 98 16.0% Childbirth 44 7.2% Dyspnea 41 6.7% Fever 39 6.3% Other 72 11.7% Total 614 100.0% The mode is “Chest Pain
  • 37.
    Mode:- c f f f f f l m m             ) ( 2 ) ( 2 1 1 l =lower limit of the class where mode occurs fm= maximum frequency of the class interval observed f1 = frequency of the class preceding to modal class f2 = frequency of the class interval succeeding to modal class C= class- interval Formula for continuous grouped data
  • 38.
    For grouped data Calculatethe mode for the following frequency distribution:- IQ Range Frequency 90-100 11 100-110 27 110-120 36 120-130 38 130-140 43 140-150 28 150-160 16 160-170 1 fm f1 f2
  • 39.
    Modal class byinspection is 130-140 fm= 43 f1= 38 f2= 28 C=10 l = 130 c f f f f f l m m             ) ( 2 ) ( 2 1 1 =130.6579
  • 40.
    The most importantfactor to consider is: the level of measurement or type of variable Nominal variables: the mode is the only appropriate measure of central tendency. Ordinal variables: both the mode and median are used. Quantitative variables: mode, median, and mean can all be used. Mean is usually the measure of choice because it is a unique value, it uses all values in the data set, and it can be used in subsequent analyses. However, the mean is subject to extreme (high or low) values (outliers). If the data set contains extreme values, calculate median.
  • 41.
  • 42.
  • 43.
    RANGE =MAX VALUE– MIN VALUE Ex. Hb % per 100 cc of 15 persons was as follows .Calculate the range. 11.5,13.8,14.3,11.7,13.1,14.5,11.8,14.0,14.7,12.5,14.1,14.8,12.9,14.2,14.9 Step-1 Arrange the data in ascending or descending order. Step-2 Range = highest value-lowest value.
  • 44.
    Quartile Deviation:- Inthis method the ,the series is divide in four equal part or Quarters .These are represented as Q1 ,Q2 and Q3 .The distance between the third quartile and first quartile represent the quartile deviation. Q1 Q2 Q3 Lowest observation Highest observation Q.D for ungrouped data:- 2 1 3 Q Q Q   Where Q = Q.D Q3 =3rd quartile Q1 = 1st quartile
  • 45.
    Q.D for ungroupeddata:- Find the Q.D for the following data series:- 8,12,13,9,11,17,23,25,20,21,27. Step:-1 Arrange the data in ascending order 8,9,11,12,13,17,20,21,23,25,27. Step-2 Find out first and third quartile   value N Q value N Q th th 4 1 3 & 4 1 3 1           Step-4 Q1 = 11+1/4 = 3 rd value = 11 Q3= 3(11+1)/4 = 9 th value = 23 Step- 5 2 1 3 Q Q Q   = 23-11/2 = 6
  • 46.
    Q.D for groupeddata (Discrete series) :- Step-1 Calculate the frequency and c.f from the data given. Step:-2 Calculate the lower and upper quartile heights by using the formula   4 1 3 & 4 1 3 1           N Q N Q Step:-3 Apply the value of Q1 and Q3 in the formula 2 1 3 Q Q Q  
  • 47.
    Eg) Frequency distributionof height in cm of 387 students in a school is given in the table below .Find the inter quartile range and Q.D of height distribution. Height in cmc No. of students 150 28 152 40 154 52 156 100 158 60 160 48 162 32 164 20 166 7
  • 48.
    Height in cmc No.of students c.f 150 28 28 152 40 68 154 52 120 156 100 220 158 60 280 160 48 328 162 32 360 164 20 380 166 7 387 Step:-1 Find the C.F Step:-2   4 1 3 & 4 1 3 1           N Q N Q Q1 of the data = 97th students and Q3 of the data = 291st students 97 student is included in the 3rd group having height 154 cm. 291 student is including in the 6th group having height 160 cm Q1 = 154, Q3= 160 .
  • 49.
    Inter Quartile range= (Q1-Q3) = 154- 160cm Q.D = 3
  • 50.
    Q.D for thegrouped data ( continuous series):- h f cf N l Q h f cf N l Q                     4 3 4 3 1 Apply the value of Q1 and Q3 in the formula 2 1 3 Q Q Q  
  • 51.
    Eg. Water percentagein the body of species of fish and their frequency is given in the table below. Calculate the Q.D. sr.no Class interval fre 1 16-20 4 2 21-25 3 3 26-30 8 4 31-35 9 5 36-40 14 6 41-45 3 7 46-50 3 8 51-55 2 9 56-60 2 10 61-65 2
  • 52.
    sr.no Class interval fre c.f 116-20 15.5-20.5 4 4 2 21-25 20.5-25.5 3 7 3 26-30 25.5-30.5 8 15 Q1 4 31-35 30.5-35.5 9 24 5 36-40 35.5-40.5 14 38 Q3 6 41-45 40.5-45.5 3 41 7 46-50 45.5-50.5 3 44 8 51-55 50.5-55.5 2 46 9 56-60 55.5-60.5 2 48 10 61-65 60.5-65.5 2 50
  • 53.
    N= 50 Q1 =N/4 = 12.5 Q3 = 3N/4 = 37.5 l = lower limit of class-interval in which Q1 lies = 25.5 l = lower limit of class-interval in which Q3 lies = 36.5 h f cf N l Q           4 1 l= 25.5 Cf= 7 f = 8 h = 5 = 28.93
  • 54.
  • 55.
    Mean deviation:-As themean of all the deviations in a given set of data obtained from an average. M.D for ungrouped data :-   N X X   Calculate the mean deviation from the following data :- X 15 17 19 25 30 35 48 X (X- mean)=deviation 15 -12 17 -10 19 -8 25 -2 30 3 35 8 48 21 64 M.D=   N X X   Mean= 27
  • 56.
    M.D for groupeddata :- Calculate the M.D for the given data series :- Class-interval fre 0-4 4 4-8 6 8-12 8 12-16 5 16-20 2
  • 57.
    Solution:- Class-interval frequency mid-valuefx x- mea n l F*(x-mean) l 0-4 4 2 8 -7.2 -28.8 4-8 6 6 36 -3.2 -19.2 8-12 8 10 80 0.8 6.4 12-16 5 14 70 4.8 24 16-20 2 18 36 8.8 17.6 96 Mean = n x n i    1 i i x f = 9.2 Sum of multiplication of each frequency and deviation from mean.
  • 58.
  • 59.
    Standard deviation:- S.Dis an important measure of dispersion. •A large S.D shows that the measurements of the frequency distribution are widely spread out from the mean.. eg. 10 mm in case of BP. • A small S.D shows that the measurements of the frequency distribution are closely spread in the neighborhood of mean. eg 2cm in case of height. • SD helps us to predict how far a given value is always from mean. Use of SD:- When populations are combined or when samples are combined their SD pooled after appropriate reasoning. Eg. Comparison of Surgical treatment for cancer lung from two studies may require knowledge of respective SD which are subsequently pooled
  • 60.
    Calculate SD forungrouped data:-   1 2     n X X SD For grouped data (discrete series):-   1 2     n X X f SD
  • 61.
    For grouped data(continuous):-   n X X f SD    2 OR 2 2             n fx n fx SD
  • 62.
    Ex. Find theSD, variance and SE of the ESR ,found to be 3,4,5,4,2,4,5 and 3 in 8 normal individuals. 3 4 5 4 2 4 5 3 i X 2 ) ( X Xi  Step- 1 , Calculate the mean of x. sum n x n i    1 i x sum
  • 63.
    Ex. Find theSD of the ESR ,found to be 3,4,5,4,2,4,5 and 3 in 8 normal individuals 3 4 0.0625 5 1.5625 4 0.0625 2 3.0625 4 0.0625 5 1.5625 3 0.5625 30 7.5 3.75 i X 2 ) ( X Xi  MEAN sum 0.5625   1 2     n X X SD = 7 5 . 7 = 1.03
  • 64.
    Class- interval Frequency 16-272 27-38 3 38-49 4 49-60 4 60-71 3 71-82 7 82-93 4 Total 27 Ex. Calculate the SD , Variance and SE for the following data .
  • 65.
    Class- interval Fre Mid-point(x)(x-Mean of X)2 F*(x-mean of x)2 16-27 2 27-38 3 38-49 4 49-60 4 60-71 3 71-82 7 82-93 4 Total 27 n Calculate the sum
  • 66.
    Class- interval FrequencyMid-point(x) (x-Mean of X)^2 f(x-mean of x)^2 16-27 2 27-38 3 38-49 4 49-60 4 60-71 3 71-82 7 82-93 4 Total 27 n Mean of x     2 X X f
  • 67.
  • 68.
    Age (years ) No. ofPts. (f) 25 - 34 35 - 44 45 - 54 55 - 64 15 25 8 2 50 Calculate the SD , Variance and SE for the following data .
  • 69.
    Coefficient of variation:- itis one of the useful terms which is used to compare the variability of two diverse population with different units of Measure like height by weight , BP by blood cell diameters. 100   mean SD CV It express the size of SD in relation to the size of mean and further converted to percentage. Ex. In a Series of boys ,the mean systolic BP was 120 and SD was 10 .In the same series mean height and SD were 160 cm and 5 cm ,respectively. Find which character show greater variation? CV of BP = 8.3% CV of height = 3.1% Thus , BP found to be a more character than height, 8.3/3.1 =2.7times.
  • 70.
    Ex)The study wasconducted to know the effect of Vit.D3 supplementation in DM type-2 patients ,which includes age, sex,vit D3 pre &post,HbA1c pre& post, FBS pre and post ……….
  • 71.
    This study includesqualitative and quantitative both variables, where gender is only qualitative (binary) variable reaming all others are quantitative, Which can be represented by Mean and Standard deviation.
  • 72.
    Study Variables NMean SD Age 78 56 9.152221 Vit D3 level 78 27.94987 13.53687 HbA1C level 78 7.964156 1.804317 FBS__pre 78 140.7143 42.39472 PPBS_pre 78 210.4935 74.03726 Vit D3_post 78 34.62117 11.79473 HbA1C_post 78 7.350649 1.616659 Quantitative Variables which represented by Mean and SD Note: when some extreme values present in the given data(quantitative) ,which can be represented as MEDIAN &INTERQUARTILE RANHGE instead of Mean and SD
  • 73.
    Sampling Techniques D ATACOLLECTION METHOD
  • 74.
    What is Sampling? •Sampling is a statistical procedure that is concerned with the selection of the individual observation; it helps us to make statistical inferences about the population.
  • 75.
    What is Population? Populationis an entire group of study. +ve patients of HIV in Surat city Population
  • 76.
    What is Sample? •Sample is the part of Population. +ve patients of HIV in Surat city +ve patient of HIV under taking the treatment in SMIMER Population Sample
  • 77.
  • 78.
  • 80.
    Why sampling? Get informationabout large populations  Less costs  Less field time  More accuracy i.e. Can Do A Better Job of Data Collection  When it’s impossible to study the whole population
  • 81.
    Types of sampling •Non-probability sampling • Probability sampling
  • 82.
    Sampling Techniques:- Probability sampling 1)Simplerandom sampling 2) Systematic sampling 3) Stratified random sampling 4)Cluster sampling 5) Multistage sampling 6) Multiphase sampling
  • 83.
    Simple random sampling Whatis it? Every individual of population has an equal chance to be selected. When we can apply? When the population is Small, Homogeneous and readily available. Eg) Patients coming to the Hospital or admitted in the ward. SRS Lottery Method Random Number Table
  • 84.
    Table of randomnumbers 6 8 4 2 5 7 9 5 4 1 2 5 6 3 2 1 4 0 5 8 2 0 3 2 1 5 4 7 8 5 9 6 2 0 2 4 3 6 2 3 3 3 2 5 4 7 8 9 1 2 0 3 2 5 9 8 5 2 6 3 0 1 7 4 2 4 5 0 3 6 8 6 …………………….
  • 85.
    EX) Select asample of 10 from a population of 300 female patients attending the MCH. ---- Step 1 ) 300 is the three digit figure. First three rows of the random table are chosen. 034 ,977 ,167 , 125 , 555 , 162 , 844 , 630 , 332 , 576 . The number selected for the sample will be 34 , 77,167 ,125,255,162,244,32,276, If some numbers repeated ,they can be rejected .
  • 86.
    Systematic sampling Population Large ,Scatteredand Homogeneous Process of selection of sample:- desired Size Sample Population Total Fraction Sampling K   10% of sample to be taken out of 1000 population
  • 87.
    10 1000 % 10 1000   of K Step- 1 Calculatethe K. Step:- 2 Select any one number randomly (from random no. table) from 1 to 10. Step:- 3 Supposing it is 6 . Step :- 4 for second sample no 10+6 = 16 For third sample 16+10 =26 26+10 = 36 and so on.
  • 88.
    Stratified Sampling Population Largeand not Homogeneous The population first we divided in the homogeneous group That groups or classes are called strata
  • 89.
    Cluster sampling Cluster Isa randomly selected group Cluster: a group of sampling units close to each other i.e. crowding together in the same area or neighborhood
  • 90.
    Cluster sampling isan example of 'two-stage sampling' . *First stage a sample of areas is chosen; •Second stage a sample of respondents within those areas is selected. *Population divided into clusters of homogeneous units, usually based on geographical contiguity. *Sampling units are groups rather than individuals. *A sample of such clusters is then selected. *All units from the selected clusters are studied.
  • 91.
    Advantages : Cuts downon the cost of preparing a sampling frame. This can reduce travel and other administrative costs. Disadvantages: sampling error is higher for a simple random sample of same size. Often used to evaluate vaccination coverage in EPI
  • 92.
    •Identification of clusters –Listall cities, towns, villages & wards of cities with their population falling in target area under study. –Calculate cumulative population & divide by 30, this gives sampling interval. –Select a random no. less than or equal to sampling interval having same no. of digits. This forms 1st cluster. –Random no.+ sampling interval = population of 2nd cluster. –Second cluster + sampling interval = 4th cluster. –Last or 30th cluster = 29th cluster + sampling interval
  • 93.
    • Freq cf cluster • I 2000 2000 1 • II 3000 5000 2 • III 1500 6500 • IV 4000 10500 3 • V 5000 15500 4, 5 • VI 2500 18000 6 • VII 2000 20000 7 • VIII 3000 23000 8 • IX 3500 26500 9 • X 4500 31000 10 • XI 4000 35000 11, 12 • XII 4000 39000 13 • XIII 3500 44000 14,15 • XIV 2000 46000 • XV 3000 49000 16 • XVI 3500 52500 17 • XVII 4000 56500 18,19 • XVIII 4500 61000 20 • XIX 4000 65000 21,22 • XX 4000 69000 23 • XXI 2000 71000 24 • XXII 2000 73000 • XXIII 3000 76000 25 • XXIV 3000 79000 26 • XXV 5000 84000 27,28 • XXVI 2000 86000 29 • XXVII 1000 87000 • XXVIII 1000 88000 • XXIX 1000 89000 30 • XXX 1000 90000 • 90000/30 = 3000 sampling interval
  • 94.
    Multi stage Sampling Employeein large country survey In the first stage random no. of district are chosen in all the stage Then talukas , villages Then third stage units will be houses. All ultimate units (houses, for instance) selected at last step are surveyed.
  • 95.
    MULTI PHASE SAMPLING Partof the information collected from whole sample & part from subsample. In Tb survey MT in all cases – Phase I X –Ray chest in MT +ve cases – Phase II Sputum examination in X – Ray +ve cases - Phase III Survey by such procedure is less costly, less laborious & more purposeful
  • 96.
    Multiphase sampling:- In Tuberculosis Survey FirstPhase Physical examination or Manteux test (In +ve patients ) Chest X-ray may be done in Mantoux +ve test Sputum may be examine in X-ray +ve cases
  • 97.
    Non probablity SamplingMethods •Convenience Sampling •Quota sampling •Purposive sampling
  • 98.
    •It is nonprobability sampling. •Sample is selected as a matter of convenience not bases on the probability theory . •For example , in clinical practice , doctors might uses patients who are available to him/her. Convenience Sampling
  • 99.
    Involves sampling aquota of units to be selected from each population cell based on the judgment of the researchers and/or decision makers Steps 1) Divide the population into segments (referred to as cells) based on certain control characteristics 2) Determine the quota of units for each cell (quotas are determined by the researchers and/or decision makers) 3) Instruct the interviewers to fill the quotas assigned to the cells Quota Sampling
  • 100.
    •Purposive sampling •If somecharacteristics of the population are known as a result of previous survey, samples are chosen by purposive selection . •As result ,certain features of sample selected purposively are likely to tally with those of population . •Also due to scarcity of time , limitation of investigators and scarcity of funds.
  • 101.
    Sampling and Non-SamplingErrors… Two major types of error can arise when a sample of observations is taken from a population: sampling error and no sampling error. Sampling error refers to differences between the sample and the population that exist only because of the observations that happened to be selected for the sample. Random and we have no control over. Non sampling errors are more serious and are due to mistakes made in the acquisition of data or due to the sample observations being selected improperly. Most likely caused be poor planning, sloppy work, act of the Goddess of Statistics, etc.
  • 102.
    Sampling Error… Sampling errorrefers to differences between the sample and the population that exist only because of the observations that happened to be selected for the sample. Increasing the sample size will reduce this type of error.
  • 103.
    Non sampling errorsare more serious and are due to mistakes made in the acquisition of data or due to the sample observations being selected improperly. Three types of non sampling errors: Errors in data acquisition, Nonresponse errors, and Selection bias. Note: increasing the sample size will not reduce this type of error. Non sampling Error…
  • 104.
    5.104 Errors in dataacquisition… • …arises from the recording of incorrect responses, due to: • — incorrect measurements being taken because of faulty equipment, • — mistakes made during transcription from primary sources, • — inaccurate recording of data due to misinterpretation of terms, or • — inaccurate responses to questions concerning sensitive issues.
  • 105.
    5.105 Nonresponse Error… • …refersto error (or bias) introduced when responses are not obtained from some members of the sample, i.e. the sample observations that are collected may not be representative of the target population. • As mentioned earlier, the Response Rate (i.e. the proportion of all people selected who complete the survey) is a key survey parameter and helps in the understanding in the validity of the survey and sources of nonresponse error.
  • 106.
    5.106 Selection Bias… • …occurswhen the sampling plan is such that some members of the target population cannot possibly be selected for inclusion in the sample.
  • 107.
  • 108.
    1) What isthe median of the following set of scores? 18, 6, 12, 10, 14 2) We consider observations reporting the eye color of a group of 15 people: Brown, Brown, Blue, Brown, Green, Gray, Blue, Blue, Green, Brown, Gray, Brown, Brown, Blue, Green. 1.Construct a frequency table. 2. Draw the associated bar graph. 3) Ten patients at a doctor’s surgery wait for the following lengths of times to see their doctor. 5 mins ,17 mins, 8 mins ,2 mins, 55 mins, 9 mins, 22 mins ,11mins, 16 ,5 mins .What are the mean, median and mode for these data? What measure of central tendency would you use here? 4) Calculate the mean and standard deviation of the following set of data. Birth weight of ten babies (in kilograms) 2.977 3.155 3.920 3.412 4.236 2.593 3.270 3.813 4.042 3.387
  • 110.
    5. In asurvey of sleep apnea scores among 10 people, the highest sample of 58 was entered by mistake as 85. This will affect the result as 1.Increased mean, increased median 2.Increased mean, no change in median 3.Non-change in mean, increase median 4.Increased mean, decreased median
  • 111.
    1.Histogram 2.Line diagram 3.Box andWhisker plot 4.Kaplan Meyer plot Identify the diagram shown
  • 112.
  • 113.
    Likert scale is? 1.Ordinalscale 2.Nominal scale 3.Variance scale 4.Categorical scale
  • 114.
    The individual ina village population is arranged alphabetically and every 8th person is selected for the study. The type of study is 1.Simple random sampling 2.Stratified random sampling 3.Systemic random sampling
  • 115.
    A study doneon a group of patients showed a coefficient of variance of BP and serum creatinine to be 20% & 15% respectively. Inference is that 1.Variation of BP is more than in serum creatinine 2.Variation in serum creatinine is more than in BP 3.The standard deviation of BP is more than of creatinine 4.The standard deviation of creatinine is more than of BP
  • 116.
    Which is nota measure Of dispersion? 1.Mean deviation 2.Standard deviation 3.Mode 4.Range
  • 117.
    Scatter diagram represents 1.Frequencyof occurrence 2.Trend over time 3.Correlation / Association 4.None of the above
  • 118.
    Research selected allpossible samples from a population and plotted their means on a line graph. This distribution is called as 1.Sample distribution 2.Sampling distribution 3.Population distribution 4.Parametric distribution
  • 119.
    Measuring relative variationbetween two different units is done by 1.variance 2.coefficient of variation 3.standard deviation 4.range
  • 120.
    The median weightof 100 children was 12 kg and it formed a normal distribution. The standard deviation was 3. Calculate the percentage of coefficient of variation. 1.25% 2.35% 3.45% 4.55%
  • 121.
    Stratified sampling isideal for 1.Heterogenous data 2.Homogenous data 3.Both 4.None
  • 122.
    Which of thefollowing is/are non-random sampling methods- a) Quota sampling b) Stratified random sampling c) Convenience Sampling d) Cluster Sampling 1.ab 2.bc 3.ac 4.cd
  • 123.
    10.True statements withregard to sampling- a) Snowball sampling is used for a hidden population b) More sample in systemic random sampling c) In stratified random sampling, the population is divided into strata d) Cluster sampling is less cost-effective 1.ab 2.bc 3.ac 4.cd
  • 124.
    11.The upper andlower limit of standard errors within which a parameter value is expected to lie are called as 1.confidence interval 2.confidence limit 3.precision levels 4.accuracy limit Ans (confidence limit)
  • 125.
    12.Evidence-based medicine, whichof the following is not useful – a) Personal exposure b) RCT c) Case report d) Meta-analysis e) Systemic review 1.ab 2.bc 3.ac 4.cd Ans (ac)
  • 126.
    13.If mean is230 and standard error is 10 then 95% of confidence limit is 1.210-250 2.250-290 3.290-330 4.190-210 Ans (210-250)
  • 127.
    17.Which one ofthe following is not a measure of dispersion – 1.Mean 2.Range 3.Mean deviation 4.Standard deviation Ans (Mean)
  • 128.
    19.In a normalcurve, the area of one standard deviation around the mean includes which of the following percent of values in a distribution – 1.0.486 2.0.683 3.0.954 4.0.997 Ans (0.683)
  • 129.
    Of a setof values is that value which occurs most frequently. (a) Mean (b) The Mode (c) Median (d) Standard deviation.
  • 130.
    The SD isan appropriate measure of spread when centre is measured with the a) Mean b) Median c) Mode d) None of the above
  • 131.
    The PEFR ofa group of 11 year old girls follow a normal distribution with mean 300 l/min and standard deviation 20 l/min: A. About 95% of the girls have PEFR between 260 and 340 l/min B. The girls have healthy lungs C. About 5% of girls have PEFR below 260 l/min D. All the PEFR must be less than 340 l/min Correct answer : A. About 95% of the girls have PEFR between 260 and 340 l/min •95.4% of values lie within 2 SD (standard deviation) of the mean •Here, SD = 20 l/min •Hence 95.4% of values lie within 300-(2*20) and 300=(2*20) •Which translates into : About 95% of the girls have PEFR between 260 and 340 l/min