#!/usr/bin/env python # coding: utf-8 # In[25]: import numpy as np import matplotlib.pyplot as plt import scipy as sp from scipy import stats plt.style.use("fivethirtyeight") import pandas as pd # # Descriptive Statistics # Before making any inferences from a dataset, it is crucial that we understand the features of datasets by 'describing' them. We can either describe them by plotting or calculating numerical summarisations. # # We will refresh most common descriptive statistics in this chapter, if you are working with data on a daily basis, no concepts here should sound strange to you. # # Furthermore, we will not engage in complex programming techniques, such as OOP and sophisticated data structures. All the programmes should be self-explanatory to most of audiences. # ## Frequency Distribution/Histogram # Strictly speaking, frequency distribution and histogram are different descriptive tools, though they are delivering the largely identical information. Frequency distribution are usually presented in a table, for instance, rolling a dice $1000$ times might give a us frequency distribution table as following: # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
Rolling A Dice 1000 Times
SidesFrequency
1172
2158
3170
4158
5187
6155
# Or we can draw a frequency histogram, which simply converts the information of table into a graph. # In[2]: rollings = np.random.randint(1, 7, 1000) fig, ax = plt.subplots(figsize=(9, 9)) n, bins, patches = ax.hist(rollings, bins=6) ax.set_title("Frequency Histogram of 1000 Times of Rolling a Dice", size=19) ax.set_xlim(0, 7) ax.set_ylim(0, 400) plt.show() # Next we try generating an array from a standard normal distribution $x\sim N(0, 1)$, then plot the histogram. And note that we add one parameter ```density=True``` in the ```hist``` method, it turns into a **relative frequency histogram**, because the $y$-axis represents relative frequencies. # In[3]: x = np.random.randn(1000) fig, ax = plt.subplots(figsize=(9, 9)) n, bins, patches = ax.hist(x, bins=50, density=True) # Careful audiences might have noticed that while we are plotting the histogram, we also have 3 variables returned ```n```, ```bins``` and ```patches```. ```n``` is (relative) frequency counts, ```bins``` is the location of each bin, ```patches``` is the list of patches objects in the plot. # # The first charts below shows the same information as histogram, but with line plots rather than bars. But the second one is called *Empirical cumulative distribution**, it is empirical because it is not theoretically plotted, i.e. using a **cumulative distribution function (CDF)**. # In[4]: fig, ax = plt.subplots(nrows=2, ncols=1, figsize=(9, 9)) ax[0].plot(bins[:50], n) ax[1].plot(np.cumsum(n)) plt.show() # # Basic Numeric Descriptives # ## Measure Of Location # The most important **mean** we need is **arithmetic mean**, which is prevalent in all kinds of statistical techniques. # $$ # \mu = \frac{1}{N}\sum_{i=1}^Nx_i\\ # \bar{x} = \frac{1}{n}\sum_{i=1}^nx_i # $$ # The former one is _population mean_, the latter is _sample mean_. The formulae appear the same, but with different indication. $N$ is the population size, imagine the number of all human beings on earth, on the other hand, $n$ is sample size, for instance a sample of $1000$ persons from UK. # # And one tricky mean commonly used in finance is **geometric mean**, not so intuitive at first sight. # $$ # g = \bigg(\prod_{i=1}^nx_i\bigg)^{1/n} # $$ # If you can't make sense out of it, accept a fact that geometric mean is commonly used when calculating _compound growth rates_, such as portfolio return. For instance, a portfolio manager has annual return recorded as below # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
Portfolio Return
YearReturn
201536%
201623%
2017-48%
2018-30%
201915%
202031%
Arithmetic Mean4.5%
Geometric Mean-1.4%
# The arithmetic mean of return is $4.45\%$. # In[5]: portfolio_return = np.array([0.36, 0.23, -0.48, -0.3, 0.15, 0.31]) np.mean(portfolio_return) # However the geometric mean of return is $-1.42\%$. Geometric mean ```sp.stats.mstats.gmean``` is more accurate measurement when considering the compound effect, i.e. the data are related to each other by a growth rate. # In[6]: sp.stats.mstats.gmean(portfolio_return + 1) - 1 # ## Measures Of Variability # The first measurement of variability is the **range** measuring distance from lowest to highest. We'll generate an array of standard normal distribution for demonstration. # In[7]: x = np.random.randn(50) # In[8]: rangeLS = x.max() - x.min() rangeLS # **Percentile** is also a common statistic concept, which could be best explained by an example. For instance, the GRE test result shows percentile besides your absolute score, say you have a percentile of $96\%$, it means your score is higher than $96\%$ of candidates. The special percentile of $75\%$ and $25\%$ are sometimes called the _third quartile_ and _first quartile_. # In[9]: q75, q25 = np.percentile(x, [75, 25]) # IQR q75 - q25 # Before moving any further, we must clarify two statistical concepts: **population** and **sample**. For example, we want to know the variance of all human adults height, therefore we ought to measure all adults (population) on earth in order to calculate the variance, however the mission is impossible, instead we measure a smaller group of adults (sample) to make inferential statements about population. # # Population and sample variance differ in degree of freedom, where $N$ is the population size, whereas $n$ is the sample size, $\mu$ is the population mean and $\bar{x}$ is the sample mean. The formulae of variances are # # $$ # \sigma^2 = \frac{\sum(x_i - \mu)^2}{N}\\ # s^2 = \frac{\sum(x_i - \bar{x})^2}{n-1} # $$ # # The latter is an unbiased estimator of population variance, which is also the _sample variance_. # # To illustrate the idea, let's pretend there are only $N = 1000$ people on earth, we can generate an array to represent the population height, only for demonstrative purpose. We generate a population by setting ```loc=170``` and ```scale=10``` with ```np.random.normal```, i.e. $X\sim N(170, 10)$, then calculate the population variance by using ```np.var()```. # In[10]: population_height = np.random.normal(170, 10, 1000) np.var(population_height) # Now suppose we know nothing about the population, but we can get a sample of 100 persons. By setting ```ddof=1``` in function, we actually mean the degree of freedom is $N-1$, used on sample estimators. # In[11]: sample_height = np.random.choice(population_height, size=100) np.var(sample_height, ddof=1) # Theoretically, we can have tremendous amount of samples, say $10000$ samples. Yes, I mean $10000$ samples, not the sample size, but this is just a thought experiment, will never be achieved in real world. Again, pure demonstrative purpose. # # What we are doing next: generate $10000$ samples, calculate the sample variances, plot histogram. The vertical line is the mean of sampling distribution of variance estimates. We set standard deviation of $\sigma = 10$, the theoretical variance $\sigma^2=100$, therefore we see the point estimator is doing a fair well job. # In[12]: sample_height_array = [] for i in range(10000): sample_height = np.random.choice(population_height, size=100) sample_height_array.append(np.var(sample_height, ddof=1)) fig, ax = plt.subplots(figsize=(9, 9)) n, bins, patches = ax.hist(sample_height_array, bins=50) ax.axvline(x=np.mean(sample_height_array), color="tomato") ax.text( np.mean(sample_height_array) + 1, np.max(n), r"$\mu_\sigma = {:.2f}$".format(np.mean(sample_height_array)), size=16, ) ax.set_title("Sampling Distribution of Variance Estimates", size=19) plt.show() # I guess you have noticed the power of estimators now, if you have many samples, by using an unbiased sample estimator, you will get an accurate estimate of population parameters. Even if you have only one sample, you can still can estimate how possible you would be correct based on sampling distribution. # But we shall keep in mind that **standard deviation** is the most popular measurement of variability. The population/sample standard deviation is simply the square root of variances respectively. # $$ # \sigma = \sqrt{\sigma^2}\\ # s = \sqrt{s^2} # $$ # # Similar to variance function in NumPy. # In[13]: np.std(sample_height_array, ddof=1) # ## Measures of Distribution Shape # **z-score** is defined as below, used for measuring how many standard deviations away from the mean. # # \begin{equation} # z_i = \frac{x_i-\bar{x}}{s} # \end{equation} # # Note that $z$ has subscript notation $i$ which means each observation has its own $z$-score. # #
Actually,measuring how many standard deviations away from the mean (or hypothesis) is the fundamental philosophy of frequentist statistics, it asks one important question: how far away from the mean is far-away? If it is far enough, very likely the mean we are looking at right now is not the 'real' mean of the random mechanism that generates the observation.
# # Here is the example of calculating $z$-score for an randomly generated array. So you can see for a standard normal distribution, it would be fairly hard to stray $2$ standard deviations away. # In[14]: x = np.random.randn(10) z = (x - np.mean(x)) / np.std(x) np.round(z, 2) # ### Chebyshev's Theorem # **Chebyshev's Theorem** is used as the last resort of deduction when we have absolute _no knowledge_ of a sample and its distribution. It guarantees minimum proportion of data that must be within $z$ standard deviation from the mean, where the $z$ is $z$-score. But the downside of the theorem is that it only addresses the symmetric distribution. # # \begin{equation} # p \geq 1-\frac{1}{z^2} # \end{equation} # # For instance, the average height of people in Helsinki is $174cm$, with a standard deviation of of $4cm$, so the question is how many people (percentage) are within $166cm$ and $182cm$? Note that this range is symmetrically $2$ standard derivations away from its mean. Thus according to Chebyshev, we know that # In[15]: z_l = (166 - 174) / 4 # lower z-score z_u = (182 - 174) / 4 # upper z-score # Because it is a symmetric range, $z$-score on each sides are equal, we can calculate the probability as below and print the conclusion. # In[16]: p = 1 - 1 / z_l**2 print("At least {0}% of people are within 168cm and 182cm in Helsinki.".format(p * 100)) # If you think of Chebyshev as a function, then the only variable is $z$-score. Now we define a Chebyshev function and plot it against $z$-score. # # Because $z$-score means how many standard deviation away from the mean, from the graph we could conclude that no matter what types of distributions we are investigating, it is guaranteed that $\pm 2.5$ standard deviation range would cover at least $84\%$ of data. # In[17]: def chebyshev(z): return 1 - 1 / z**2 chebyshev_array = [] for z in np.arange(1, 21, 0.5): chebyshev_array.append(chebyshev(z)) fig, ax = plt.subplots(figsize=(9, 9)) ax.plot(np.arange(1, 21, 0.5), chebyshev_array) ax.scatter(2.5, chebyshev(2.5), s=100, color="red", zorder=3) ax.text(2.5 + 0.5, chebyshev(2.5), r"(2.5, {}%)".format(chebyshev(2.5) * 100)) ax.set_title("Chebyshev's Theorem") ax.set_xlabel("z-score") ax.set_ylabel("Probability") plt.show() # However in practice we are more likely to deal with data of (semi)bell-shape distribution, they are regular and easier to make deduction. So **Empirical Rules** apply. # # * 68% of data within 1 standard deviation of the mean # * 95% of data within 2 standard deviation of the mean # * 99.7% of data within 3 standard deviation of the mean # # These empirical numbers are from normal distribution. # ## Measures of Association Between Two Variables # With similar notation, the **population** and **sample covariance** is defined as: # $$ # \sigma_{xy} \frac{\sum(x_i-\mu_x)(y_i-\mu_y)}{N}\\ # s_{xy} \frac{\sum(x_i-\bar{x})(y_i-\bar{y})}{n-1} # $$ # # Generate two random arrays, then evaluate a covariance matrix. The values on the diagonal presents the variance, the off-diagonal presents the covariance. # In[18]: x = np.random.randn(300) y = np.random.rand(300) np.cov(x, y) # **Correlation Coefficient** is normalized version of covariance, just like the standard deviation is the normalised variance, some statistics textbook calls it Pearson Product Moment Correlation Coefficient. # # $$ # \rho = \frac{\sigma_{xy}}{\sigma_x\sigma_y}\\ # r_{xy}=\frac{s_{xy}}{s_xs_y} # $$ # For the sake of everyone's sanity, not to cluster the lecture notes with repetitive codes, some of them were transported in to ```plot_material.py``` module. Take a look inside if you are curious. # # What we did below was basically generating eight linear regression plot $Y = \beta_1+\beta_2X+u$ with different parameters. The correlation coefficient $\rho$ is displayed in the graph, therefore you could observe that the correlation coefficients are affected by size of parameters $\beta_2$ and scale of error term. # # The reasons are: #
    #
  1. The more significantly $\beta_2$ differs from $0$, the more significant linear relationship the model has, therefore the relatively larger correlation between $X$ and $Y$.
  2. #
  3. If the parameters $\beta_1$ and $\beta_2$ kept constant, the larger scale of the error term, the more dispersed the data are, therefore lower correlation as well.
  4. #
      # In[19]: import plot_material plot_material.reg_corr_plot() # We will dive much deeper into linear regression in our econometrics training sessions. # ### Different Types Of Correlation Coefficient # The correlation coefficient above is the **Pearson correlation**, which assumes a linear relationship between two variables. However, there are other correlation coefficients which also measures nonlinear correlation, such as **Spearman correlation** and **Kendall's $\tau$**. # In[51]: X = np.linspace(-10, 10, 200) Y = 1 / (1 + np.exp(-X)) df_dict = {"X": X, "Y": Y} df = pd.DataFrame(df_dict) df.plot(x="X", y="Y", kind="scatter", figsize=(16, 7)) plt.show() # In[67]: df.corr(method="pearson") # In[62]: print("Pearson coeffcient: {}".format(sp.stats.stats.pearsonr(df["X"], df["Y"])[0])) # In[63]: print("Pearson coeffcient: {}".format(sp.stats.stats.spearmanr(df["X"], df["Y"])[0])) # In[64]: sp.stats.stats.kendalltau(X, Y) print("Pearson coeffcient: {}".format(sp.stats.stats.kendalltau(df["X"], df["Y"])[0])) # Notice the difference, that Pearson and Kendall produce a perfect correlation coefficient, it is because the latter two are rank tests.