STATISTICS
A specialized branch of mathematics - RA Fisher
Statistics - both singular and plural
it is QUANTITATIVE data , it may FINITE and INFINITE
DIAGRAM
 simple bar: single character , multiple bar: multiple (ONE dimension)
 Component bar : bar height depends on TOTAL
 Percentage : bar height SAME for all
 Pie chart : component of factor = SECTOR, alternative STEP BAR diagram
 Bar : base v – bar h , column : base h – bar v
GRAPH : Graphical representations for grouped quantitative data
HISTOGRAM:
 classified based on the class intervals
 suitable for calculating MODE
 EQUAL class interval if not then bar height proportion to frequency DENSITY
 No gap between bars due to CONTINUOUS class
 Bar height = Corresponding frequency of respective class
FREQUENCY POLYGON : dots against the mid-points connected by STRAIGHT line
FREQUENCY CURVE : dots against the mid-points connected by SMOOTH/FREE HAND line
OGIVE / cumulative frequency curve (value v/s cumulative frequency)
 Less than ogive : plotted against upper boundary of class interval
 More than ogive : plotted against lower boundary of class interval
 2 type ogive intersect at MEDIAN
 Can be calculated GRAPHICALLY PARTITION, Median, Decile, Percentile
PICTOGRAM: Non dimension, less accurate, used by DILETTANTE, data in COUNT, PICTURE
BOX PLOT: Multiple group continuously, handle SKEWED data well, Identify OUTLIER
FREQUENCY DISTRIBUTION
 Frequency of a variable is always INTEGER
 Frequency Distribution can be both CONTINUOUS and DISCREATE
 Individual series : DISCRETE series, each variant frequency is 1
 Open end distribution: UNCERTEIN first and last class
 Simple frequency distribution : All distinct value with their frequency
 Group frequency distribution : All value in their CLASSES with their FREQUENCY
 Continuous variable : any number , discrete: only INTEGER , VARIET : single observation
TABLE :
 simple table : one factor/variable , Complex : 2 or more
 first/heading column : STUB , first row? / column heading : CAPTION
CENTRAL TENDENCY
ARITHMETIC MEAN :
most common, BEST, Rigidly defines, based on all observation
not based on position, works even data lack, affected least by fluctuations
Can’t calculate qualitative data and open-end data, MOST affected by extreme value
MEDIAN :
Middle most, QUALITATIVE data (example : Intelligence, ability)
Not affected by extreme value , positional average, open-end series , datalack work
in case of even number item or continuous series result out of series,
Slight change = drastic change , use only in MEAN DEVIATION , not take all observation
MODE :
Most/Max frequent (CONCENTRATED), Qualitative (but less than median), positional measure
Not effected by extreme value , Large number value = observation of maximum frequency
Example : shoe/garment size , meteorological forecasting
HARMONIC MEAN :
Reciprocal of A.M OF Reciprocal of values ( example : average speed, distance , rate)
rigidly defined on all observations, amenable to further algebraic treatment.
Most suitable for HIGHLY VARIABLE series and when greater weight to smaller observations
Avarage speed : for same distance = 2AB/(A+B) , for same time (A+B)/2
GEOMETRIC MEAN :
best when data is RATIO or PERCENTAGE ; Example : Bacterial growth , cell division
MISCELLANEOUS:
Add/sub/mul/dev by any value with all the value of series will change same in mean
Quadratic mean : for negative value ; QM > AM
Most UNSTABLE is Geometric Mean
Normally : AM >= GM >= HM but for SAME OBSERVATION: AM = GM = HM
Median=Middle value=50th
percentile = 2nd
quartile= 5th
decile
Symmetrical distribution : Mean = Median = Mode
Skewed distribution : Mean – Mode = 3 (Mean - Median)
DISPERSION
dispersion : scatternets or variation of observations from their average
RANGE :
Used in quality control, weather forecasts, share price analysis
STANDARD DEVIATION :
positive square-root of the arithmetic mean of the Square of the deviations of the given observation
from their arithmetic mean
basis for measuring the COEFFICIENT OF CORRELATION and sampling ,
Have characteristics of MEAN, possible further algebraic treatment,
have same UNIT of original, can’t use for COMPARISON
VARIANCE :
variance = (SD)^2 , if all value same then variance is 0
Average of sum of square of deviation, Unit is different of original
COEFFICIENT OF VARIATION :
C.V = (SD/Mean)x100 , a RELATIVE measure of dispersion
More C.V. = more variable, less stable, less homogeneous.
MEAN DEVIATION :
MD is minimum at MEDIAN, Take all observations
Sum sq of deviation minimum when taken from MEAN
Ignore sign of deviation in central tendency
QUARTILE DEVIATION :
(Q3-Q1)/2, Positional, Coefficient = (Q3-Q1)/(Q3+Q1) , Only Can calculate OPEN-END
SKEWNESS :
Lack of symmetry of tails in FD (Frequency DIstribution) curve
Negative : u3 < 0, LEFT tail more elongated, Mean < median < Mode comes to LEFT
POSITIVE : RIGHT tailed elongated , Mean > median > Mode comes to RIGHT
Kerl perason’s Skewness = (Mean - Mode) / SD
MISCELLANEOUS
4SD=5MD=6QD=2/3R
How to calculate SD
BEST/most reliable : SD , Worst : QD , Unitless : CV
EXTREME : Most- Range, SD ; Least – QD, MD
All are absolute but CV is RELATIV
All are changes with scale but not with origin (CV unaltered)
PROBABILITY
A' U B' = (A n B)' ; BD , PD = discrete (PMF) ; ND = continuous (PDF)
BIONOMIAL DISTRIBUTION :
success or failure ; p+q=1 and p(x)=(N/x)(p^x . q^{n-x})
AM (U1)= np ; variance (U2) = npq ; skewness (U3)= npq(q-p) ; kurtosis (U4)=npq(1+3pq{n-2})
P < ½ = +ve skewed ; > ½ -ve skewed ; = ½ SYMMETRY
Mean > Variance ; N = 1 tends to barnouli ; = infinite tends to poisson
POISSON DISTRIBUTION
Here the Lamda(y) = parameter of PD = Mean = Variance = Skewness = always > 0
Kurtosis (U4) = 3y^2 + y ; ex-death, defect, miscall
NORMAL DISTRIBUTION :
deMovire, Bess shape; curve under 1; Symmetric about mean;
Mean=Median=Mode ; U3=0 ; U4=3 ; Range : - ∞ to + ∞ ;
RANGE = 6u ; MD = 4/5u ; QD = 2/3u
NORMAL CURVE
68% of data lies within ±1σ of the mean.
95% of data lies within ±2σ of the mean.
99.7% of data lies within ±3σ of the mean.
inflection point : changes its curvature : x = μ ± σ
TEST OF HYPOTHESIS
Null Hypothesis – H0 – No difference – RA Fisher |
Alternate – H1 ; H 1 : µ1 < µ2 = left tailed ; H 1 : µ1 > µ2 = right tailed
Type I error : Alfa : Rejecting H0 when it is true
Type II : Beta : Accepting H0 when it is false
DF : Total Number - Constraint = N-K
LOS (Level of significance): Maximum probability of Type I error (5% or 1 %)
Critical value : decide wheather accept/reject Null Hypothesis
One tailed test –critical region falls on one end (H1 : U1 > U2 or, U1 < U2)
Two tailed test – critical region falls on either end (H1 : U1 not equal to U2)
Large sample n≥30 : Z test ; Small sample ,n<30 : t , F, Chi Square
Critical Region : Depends on Type I error size
TEST OF SIGNIFICANCE
T TEST
Sample <30 ; - Gosset ; Paired and Impaired
Helps to observe significance of Correlation coefficient, regression coefficient
CHI SQUARE TEST
Sample >50 ; Non parametric ; Helmet & pearson ; (ex-genetic porblem)
ANOVA / F TEST
df = t – 1 ; Treatment = BETWEEN; Error = WITHIN
If F ≈ 1: Variance between groups ≈ variance within groups ⇒ no difference b/w treatments.
If F >> 1: b/w groups > w/w groups ⇒ at least one treatment mean is significantly different.
Larger F-values typically suggest stronger evidence against the null hypothesis.
Z TEST :
Asymptotic ; >30 ; RA Fisher ; (ex-tea drinker)
Z cal < Z tab -We accept the Ho
Two tailed 5% 1.96 , 1% 2.58 ; One tailed 5% 1.65, 1% 2.33
Z SCORE & FISHER Z :
P VALUE : P value < 0.05 or <5% = reject Null Hypothesis
Z-test when population SD is known; otherwise t-test.
Chi- for categorical data, ANOVA for comparing more than 2 means
ERROR
STANDARD ERROR
SE = SD / root of N
SAMPLING ERROR
Sampling error = Estimation – Parameter = Sample statistics – population parameter
Sampling Error : Due to random sampling variability
Non-Sampling Error : Due to bias, measurement, data entry, etc.
EXPERIMENTAL DESIGN
for TOS (Test of significance) – RA Fisher
CRD (COMPLETELY RANDOMIZED DESIGN)
One way classification, No way control or elimination
When material is LIMITED and HOMOGENOUS (ex-soil and pot experiment)
1.Replication (Independent)
2.Randomization (used)
3.Local control (not uused – due to CRD works on HOMOGENOUS only)
EDF(Error degree of Freedom) : t(r-1) Maximum among all;
FG (Fertility gradient) : zero (as it is homogeneous)
RBD (RANDOMIZED BLOCK DESIGN)
Two way classification, One way control
Use all 3 principle
FG = 1 (one direction) ; EDF = (r-1)(t-1)
Max treatment: <21 (optimum 5-12)
More accurate than CRD , MOSTLY Used
LSD (LATENT SQUARE DESIGN) :
For 5-12 treatment, Square shape ; Row = Column = Treatment = Replication
It is INCOMPLETE (because it should t cube but we take t square)
FG = 2 ; EDF = (t-1)(t-2) or (r-1)(r-2) or (t-1)(r-2) or (c-1)(c-2)
SPD (SPLIT PLOT DESIGN) :
2 treatment: Main (larger - Manure, DOS, ploughing) Submain (smaller – fertilizer , variety) , error 2
SrPD (Strip Plot Design) : both are MAIN ; error 3
CORRELATION REGRESSION :
CORRELATION :
2 way ; Dependent Variable (one effect another) ; Value : +1 to -1 ; ex – Demand & Price
Type : +ve (equal proportion) , -ve (inversely) , zero (non effect)
Measurement: scattered(most used) , kerl pearson , superman rank
REGRESSION :
Average relationship b/w variable in term of original unit of data (stripping back to average)
By Fransis Galton ; One way ; Range : - ∞ to + ∞ ; Variable dependent and independent
Independent of Origin but dependent of Scale ; AM of regression > AM of correlation
y = ax + b (a = regression coefficient or slope , b = intercept)
CORRELATION COEFFICIENT (PEARSON R)
Range : −1≤r≤1 , Unitless , r=1: perfect positive linear relationship , r=0: no linear correlation
T test for r
SAMPLING
PROBABILITY METHOD
Simple Random Sampling (SRS) Everyone has equal chance — like lottery draw
Systematic Sampling Select every kth item (e.g., every 10th student)
Stratified Sampling Divide population into groups (strata), then randomly sample from each group
Cluster Sampling Divide into clusters (e.g., villages), randomly select whole clusters, not individuals
Multistage Sampling Combine methods — e.g., pick districts (cluster), then schools (SRS) within them
NON-PROBABILITY METHOD
Convenience Sampling Choose whoever is easy to reach (e.g., asking friends)
Judgmental (Purposive) Sampling You choose samples based on what you think is best
Quota Sampling Set quota per group (e.g., 50 men, 50 women), but choose non-randomly
Snowball Sampling For hard-to-find groups (e.g., drug users), ask each participant to refer others
Census : All unit ; Sample survey : selected unit
Finite population : SWR (Sampling with replacement) ; Infinite population : SWOR

Agricultural statistics - Statistical science JRF note by Subham Mandal (part 1).pdf

  • 1.
    STATISTICS A specialized branchof mathematics - RA Fisher Statistics - both singular and plural it is QUANTITATIVE data , it may FINITE and INFINITE DIAGRAM  simple bar: single character , multiple bar: multiple (ONE dimension)  Component bar : bar height depends on TOTAL  Percentage : bar height SAME for all  Pie chart : component of factor = SECTOR, alternative STEP BAR diagram  Bar : base v – bar h , column : base h – bar v GRAPH : Graphical representations for grouped quantitative data HISTOGRAM:  classified based on the class intervals  suitable for calculating MODE  EQUAL class interval if not then bar height proportion to frequency DENSITY  No gap between bars due to CONTINUOUS class  Bar height = Corresponding frequency of respective class FREQUENCY POLYGON : dots against the mid-points connected by STRAIGHT line FREQUENCY CURVE : dots against the mid-points connected by SMOOTH/FREE HAND line OGIVE / cumulative frequency curve (value v/s cumulative frequency)  Less than ogive : plotted against upper boundary of class interval  More than ogive : plotted against lower boundary of class interval  2 type ogive intersect at MEDIAN  Can be calculated GRAPHICALLY PARTITION, Median, Decile, Percentile PICTOGRAM: Non dimension, less accurate, used by DILETTANTE, data in COUNT, PICTURE BOX PLOT: Multiple group continuously, handle SKEWED data well, Identify OUTLIER FREQUENCY DISTRIBUTION  Frequency of a variable is always INTEGER  Frequency Distribution can be both CONTINUOUS and DISCREATE  Individual series : DISCRETE series, each variant frequency is 1  Open end distribution: UNCERTEIN first and last class  Simple frequency distribution : All distinct value with their frequency  Group frequency distribution : All value in their CLASSES with their FREQUENCY  Continuous variable : any number , discrete: only INTEGER , VARIET : single observation TABLE :  simple table : one factor/variable , Complex : 2 or more  first/heading column : STUB , first row? / column heading : CAPTION
  • 2.
    CENTRAL TENDENCY ARITHMETIC MEAN: most common, BEST, Rigidly defines, based on all observation not based on position, works even data lack, affected least by fluctuations Can’t calculate qualitative data and open-end data, MOST affected by extreme value MEDIAN : Middle most, QUALITATIVE data (example : Intelligence, ability) Not affected by extreme value , positional average, open-end series , datalack work in case of even number item or continuous series result out of series, Slight change = drastic change , use only in MEAN DEVIATION , not take all observation MODE : Most/Max frequent (CONCENTRATED), Qualitative (but less than median), positional measure Not effected by extreme value , Large number value = observation of maximum frequency Example : shoe/garment size , meteorological forecasting HARMONIC MEAN : Reciprocal of A.M OF Reciprocal of values ( example : average speed, distance , rate) rigidly defined on all observations, amenable to further algebraic treatment. Most suitable for HIGHLY VARIABLE series and when greater weight to smaller observations Avarage speed : for same distance = 2AB/(A+B) , for same time (A+B)/2 GEOMETRIC MEAN : best when data is RATIO or PERCENTAGE ; Example : Bacterial growth , cell division MISCELLANEOUS: Add/sub/mul/dev by any value with all the value of series will change same in mean Quadratic mean : for negative value ; QM > AM
  • 3.
    Most UNSTABLE isGeometric Mean Normally : AM >= GM >= HM but for SAME OBSERVATION: AM = GM = HM Median=Middle value=50th percentile = 2nd quartile= 5th decile Symmetrical distribution : Mean = Median = Mode Skewed distribution : Mean – Mode = 3 (Mean - Median) DISPERSION dispersion : scatternets or variation of observations from their average RANGE : Used in quality control, weather forecasts, share price analysis STANDARD DEVIATION : positive square-root of the arithmetic mean of the Square of the deviations of the given observation from their arithmetic mean basis for measuring the COEFFICIENT OF CORRELATION and sampling , Have characteristics of MEAN, possible further algebraic treatment, have same UNIT of original, can’t use for COMPARISON VARIANCE : variance = (SD)^2 , if all value same then variance is 0 Average of sum of square of deviation, Unit is different of original COEFFICIENT OF VARIATION : C.V = (SD/Mean)x100 , a RELATIVE measure of dispersion More C.V. = more variable, less stable, less homogeneous. MEAN DEVIATION : MD is minimum at MEDIAN, Take all observations Sum sq of deviation minimum when taken from MEAN Ignore sign of deviation in central tendency
  • 4.
    QUARTILE DEVIATION : (Q3-Q1)/2,Positional, Coefficient = (Q3-Q1)/(Q3+Q1) , Only Can calculate OPEN-END SKEWNESS : Lack of symmetry of tails in FD (Frequency DIstribution) curve Negative : u3 < 0, LEFT tail more elongated, Mean < median < Mode comes to LEFT POSITIVE : RIGHT tailed elongated , Mean > median > Mode comes to RIGHT Kerl perason’s Skewness = (Mean - Mode) / SD MISCELLANEOUS 4SD=5MD=6QD=2/3R How to calculate SD BEST/most reliable : SD , Worst : QD , Unitless : CV EXTREME : Most- Range, SD ; Least – QD, MD All are absolute but CV is RELATIV All are changes with scale but not with origin (CV unaltered) PROBABILITY A' U B' = (A n B)' ; BD , PD = discrete (PMF) ; ND = continuous (PDF) BIONOMIAL DISTRIBUTION : success or failure ; p+q=1 and p(x)=(N/x)(p^x . q^{n-x}) AM (U1)= np ; variance (U2) = npq ; skewness (U3)= npq(q-p) ; kurtosis (U4)=npq(1+3pq{n-2}) P < ½ = +ve skewed ; > ½ -ve skewed ; = ½ SYMMETRY Mean > Variance ; N = 1 tends to barnouli ; = infinite tends to poisson POISSON DISTRIBUTION Here the Lamda(y) = parameter of PD = Mean = Variance = Skewness = always > 0 Kurtosis (U4) = 3y^2 + y ; ex-death, defect, miscall
  • 5.
    NORMAL DISTRIBUTION : deMovire,Bess shape; curve under 1; Symmetric about mean; Mean=Median=Mode ; U3=0 ; U4=3 ; Range : - ∞ to + ∞ ; RANGE = 6u ; MD = 4/5u ; QD = 2/3u NORMAL CURVE 68% of data lies within ±1σ of the mean. 95% of data lies within ±2σ of the mean. 99.7% of data lies within ±3σ of the mean. inflection point : changes its curvature : x = μ ± σ TEST OF HYPOTHESIS Null Hypothesis – H0 – No difference – RA Fisher | Alternate – H1 ; H 1 : µ1 < µ2 = left tailed ; H 1 : µ1 > µ2 = right tailed Type I error : Alfa : Rejecting H0 when it is true Type II : Beta : Accepting H0 when it is false DF : Total Number - Constraint = N-K LOS (Level of significance): Maximum probability of Type I error (5% or 1 %) Critical value : decide wheather accept/reject Null Hypothesis One tailed test –critical region falls on one end (H1 : U1 > U2 or, U1 < U2) Two tailed test – critical region falls on either end (H1 : U1 not equal to U2) Large sample n≥30 : Z test ; Small sample ,n<30 : t , F, Chi Square Critical Region : Depends on Type I error size TEST OF SIGNIFICANCE T TEST Sample <30 ; - Gosset ; Paired and Impaired Helps to observe significance of Correlation coefficient, regression coefficient CHI SQUARE TEST Sample >50 ; Non parametric ; Helmet & pearson ; (ex-genetic porblem)
  • 6.
    ANOVA / FTEST df = t – 1 ; Treatment = BETWEEN; Error = WITHIN If F ≈ 1: Variance between groups ≈ variance within groups ⇒ no difference b/w treatments. If F >> 1: b/w groups > w/w groups ⇒ at least one treatment mean is significantly different. Larger F-values typically suggest stronger evidence against the null hypothesis. Z TEST : Asymptotic ; >30 ; RA Fisher ; (ex-tea drinker) Z cal < Z tab -We accept the Ho Two tailed 5% 1.96 , 1% 2.58 ; One tailed 5% 1.65, 1% 2.33 Z SCORE & FISHER Z : P VALUE : P value < 0.05 or <5% = reject Null Hypothesis Z-test when population SD is known; otherwise t-test. Chi- for categorical data, ANOVA for comparing more than 2 means
  • 7.
    ERROR STANDARD ERROR SE =SD / root of N SAMPLING ERROR Sampling error = Estimation – Parameter = Sample statistics – population parameter Sampling Error : Due to random sampling variability Non-Sampling Error : Due to bias, measurement, data entry, etc. EXPERIMENTAL DESIGN for TOS (Test of significance) – RA Fisher CRD (COMPLETELY RANDOMIZED DESIGN) One way classification, No way control or elimination When material is LIMITED and HOMOGENOUS (ex-soil and pot experiment) 1.Replication (Independent) 2.Randomization (used) 3.Local control (not uused – due to CRD works on HOMOGENOUS only) EDF(Error degree of Freedom) : t(r-1) Maximum among all; FG (Fertility gradient) : zero (as it is homogeneous) RBD (RANDOMIZED BLOCK DESIGN) Two way classification, One way control Use all 3 principle FG = 1 (one direction) ; EDF = (r-1)(t-1)
  • 8.
    Max treatment: <21(optimum 5-12) More accurate than CRD , MOSTLY Used LSD (LATENT SQUARE DESIGN) : For 5-12 treatment, Square shape ; Row = Column = Treatment = Replication It is INCOMPLETE (because it should t cube but we take t square) FG = 2 ; EDF = (t-1)(t-2) or (r-1)(r-2) or (t-1)(r-2) or (c-1)(c-2) SPD (SPLIT PLOT DESIGN) : 2 treatment: Main (larger - Manure, DOS, ploughing) Submain (smaller – fertilizer , variety) , error 2 SrPD (Strip Plot Design) : both are MAIN ; error 3 CORRELATION REGRESSION : CORRELATION : 2 way ; Dependent Variable (one effect another) ; Value : +1 to -1 ; ex – Demand & Price Type : +ve (equal proportion) , -ve (inversely) , zero (non effect) Measurement: scattered(most used) , kerl pearson , superman rank REGRESSION : Average relationship b/w variable in term of original unit of data (stripping back to average) By Fransis Galton ; One way ; Range : - ∞ to + ∞ ; Variable dependent and independent Independent of Origin but dependent of Scale ; AM of regression > AM of correlation y = ax + b (a = regression coefficient or slope , b = intercept) CORRELATION COEFFICIENT (PEARSON R) Range : −1≤r≤1 , Unitless , r=1: perfect positive linear relationship , r=0: no linear correlation T test for r
  • 9.
    SAMPLING PROBABILITY METHOD Simple RandomSampling (SRS) Everyone has equal chance — like lottery draw Systematic Sampling Select every kth item (e.g., every 10th student) Stratified Sampling Divide population into groups (strata), then randomly sample from each group Cluster Sampling Divide into clusters (e.g., villages), randomly select whole clusters, not individuals Multistage Sampling Combine methods — e.g., pick districts (cluster), then schools (SRS) within them NON-PROBABILITY METHOD Convenience Sampling Choose whoever is easy to reach (e.g., asking friends) Judgmental (Purposive) Sampling You choose samples based on what you think is best Quota Sampling Set quota per group (e.g., 50 men, 50 women), but choose non-randomly Snowball Sampling For hard-to-find groups (e.g., drug users), ask each participant to refer others Census : All unit ; Sample survey : selected unit Finite population : SWR (Sampling with replacement) ; Infinite population : SWOR