unpingco
diff --git a/‎chapters/machine_learning/notebooks/clustering.ipynb‎
Lines changed: 496 additions & 0 deletions b/‎chapters/machine_learning/notebooks/clustering.ipynb‎
Lines changed: 496 additions & 0 deletions
diff --git a/‎chapters/machine_learning/notebooks/ensemble.ipynb‎
Lines changed: 385 additions & 0 deletions b/‎chapters/machine_learning/notebooks/ensemble.ipynb‎
Lines changed: 385 additions & 0 deletions
@@ -0,0 +1,385 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<!-- new sections -->\n",
+    "<!-- Ensemble learning -->\n",
+    "<!-- - Machine Learning Flach, Ch.11 -->\n",
+    "<!-- - Machine Learning Mohri, pp.135- -->\n",
+    "<!-- - Data Mining Witten, Ch. 8 -->"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "image/png": "../../../python_for_probability_statistics_and_machine_learning.jpg",
+      "text/plain": [
+       "<IPython.core.display.Image object>"
+      ]
+     },
+     "execution_count": 1,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "from IPython.display import Image \n",
+    "Image('../../../python_for_probability_statistics_and_machine_learning.jpg')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {
+    "attributes": {
+     "classes": [],
+     "id": "",
+     "n": "1"
+    },
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "from pprint import pprint\n",
+    "import textwrap\n",
+    "import sys, re\n",
+    "def displ(x):\n",
+    "   if x is None: return\n",
+    "   print (\"\\n\".join(textwrap.wrap(repr(x).replace(' ',''),width=80)))\n",
+    "\n",
+    "sys.displayhook=displ"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "With the exception of the random forest, we have so far considered machine\n",
+    "learning models as stand-alone entities. Combinations of models that jointly\n",
+    "produce a classification are known as *ensembles*.  There are two main\n",
+    "methodologies that create ensembles: *bagging* and *boosting*.\n",
+    "\n",
+    "## Bagging\n",
+    "\n",
+    "Bagging refers to bootstrap aggregating, where bootstrap here is the same as we\n",
+    "discussed in the section [ch:stats:sec:boot](#ch:stats:sec:boot).  Basically,\n",
+    "we resample the data with replacement and then train a classifier on the newly\n",
+    "sampled data. Then, we combine the outputs of each of the individual\n",
+    "classifiers using a majority-voting scheme (for discrete outputs) or a weighted\n",
+    "average (for continuous outputs).  This combination is particularly effective\n",
+    "for models that are easily influenced by a single data element. The resampling\n",
+    "process means that these elements cannot appear in every bootstrapped\n",
+    "training set so that some of the models will not suffer these effects. This\n",
+    "makes the so-computed combination of outputs less volatile. Thus, bagging\n",
+    "helps reduce the collective variance of individual high-variance models.\n",
+    "\n",
+    "To get a sense of bagging, let's suppose we have a two-dimensional plane that\n",
+    "is partitioned into two regions with the following boundary: $y=-x+x^2$.\n",
+    "Pairs of $(x_i,y_i)$ points above this boundary are labeled one and points\n",
+    "below are labeled zero. [Figure](#fig:ensemble_001) shows the two regions\n",
+    "with the  nonlinear separating boundary as the black curved line.\n",
+    "\n",
+    "<!-- dom:FIGURE: [fig-machine_learning/ensemble_001.png, width=500 frac=0.75]\n",
+    "Two regions in the plane are separated by a nonlinear boundary. The training\n",
+    "data is sampled from this plane. The objective is to correctly classify the so-\n",
+    "sampled data.   <div id=\"fig:ensemble_001\"></div> -->\n",
+    "<!-- begin figure -->\n",
+    "<div id=\"fig:ensemble_001\"></div>\n",
+    "\n",
+    "<p>Two regions in the plane are separated by a nonlinear boundary. The training\n",
+    "data is sampled from this plane. The objective is to correctly classify the so-\n",
+    "sampled data.</p>\n",
+    "<img src=\"fig-machine_learning/ensemble_001.png\" width=500>\n",
+    "\n",
+    "<!-- end figure -->\n",
+    "\n",
+    "\n",
+    "\n",
+    "\n",
+    "The problem is to take samples from each of these regions and\n",
+    "classify them correctly using a perceptron. A perceptron is the simplest\n",
+    "possible linear classifier that finds a line in the plane to separate two\n",
+    "purported categories. Because the separating boundary is nonlinear, there is no\n",
+    "way that the perceptron can completely solve this problem. The following code\n",
+    "sets up the perceptron available in Scikit-learn."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {
+    "attributes": {
+     "classes": [],
+     "id": "",
+     "n": "2"
+    }
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "Perceptron(alpha=0.0001, class_weight=None, eta0=1.0, fit_intercept=True,\n",
+       "      max_iter=None, n_iter=None, n_jobs=1, penalty=None, random_state=0,\n",
+       "      shuffle=True, tol=None, verbose=0, warm_start=False)"
+      ]
+     },
+     "execution_count": 2,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "from sklearn.linear_model import Perceptron\n",
+    "p=Perceptron()\n",
+    "p"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The training data and the resulting perceptron separating boundary\n",
+    "are shown in [Figure](#fig:ensemble_002). The circles and crosses are the\n",
+    "sampled training data and the gray separating line is the perceptron's\n",
+    "separating boundary between the two categories. The black squares are those\n",
+    "elements in the training data that the perceptron mis-classified. Because the\n",
+    "perceptron can only produce linear separating boundaries, and the boundary in\n",
+    "this case is non-linear, the perceptron makes mistakes near where the\n",
+    "boundary curves.  The next step is to see how bagging can\n",
+    "improve upon this by using multiple perceptrons.\n",
+    "\n",
+    "<!-- dom:FIGURE: [fig-machine_learning/ensemble_002.png, width=500 frac=0.75]\n",
+    "The perceptron finds the best linear boundary between the two classes. <div\n",
+    "id=\"fig:ensemble_002\"></div> -->\n",
+    "<!-- begin figure -->\n",
+    "<div id=\"fig:ensemble_002\"></div>\n",
+    "\n",
+    "<p>The perceptron finds the best linear boundary between the two classes.</p>\n",
+    "<img src=\"fig-machine_learning/ensemble_002.png\" width=500>\n",
+    "\n",
+    "<!-- end figure -->\n",
+    "\n",
+    "\n",
+    "The following code sets up the bagging classifier in Scikit-learn. Here we\n",
+    "select only three perceptrons. [Figure](#fig:ensemble_003) shows each of the\n",
+    "three individual classifiers and the final bagged classifer in the panel on the\n",
+    "bottom right. As before, the black circles indicate misclassifications in the\n",
+    "training data. Joint classifications are determined by majority voting."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {
+    "attributes": {
+     "classes": [],
+     "id": "",
+     "n": "3"
+    }
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "BaggingClassifier(base_estimator=Perceptron(alpha=0.0001, class_weight=None, eta0=1.0, fit_intercept=True,\n",
+       "      max_iter=None, n_iter=None, n_jobs=1, penalty=None, random_state=0,\n",
+       "      shuffle=True, tol=None, verbose=0, warm_start=False),\n",
+       "         bootstrap=True, bootstrap_features=False, max_features=1.0,\n",
+       "         max_samples=0.5, n_estimators=3, n_jobs=1, oob_score=False,\n",
+       "         random_state=None, verbose=0, warm_start=False)"
+      ]
+     },
+     "execution_count": 3,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "from sklearn.ensemble import BaggingClassifier\n",
+    "bp = BaggingClassifier(Perceptron(),max_samples=0.50,n_estimators=3)\n",
+    "bp"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<!-- dom:FIGURE: [fig-machine_learning/ensemble_003.png, width=500 frac=0.85]\n",
+    "Each panel with the single gray line is one of the perceptrons used for the\n",
+    "ensemble bagging classifier on the lower right. <div\n",
+    "id=\"fig:ensemble_003\"></div> -->\n",
+    "<!-- begin figure -->\n",
+    "<div id=\"fig:ensemble_003\"></div>\n",
+    "\n",
+    "<p>Each panel with the single gray line is one of the perceptrons used for the\n",
+    "ensemble bagging classifier on the lower right.</p>\n",
+    "<img src=\"fig-machine_learning/ensemble_003.png\" width=500>\n",
+    "\n",
+    "<!-- end figure -->\n",
+    "\n",
+    "\n",
+    "The `BaggingClassifier` can estimate its own out-of-sample error if passed the\n",
+    "`oob_score=True` flag upon construction. This keeps track of which samples were\n",
+    "used for training and which were not, and then estimates the out-of-sample\n",
+    "error using those samples that were unused in training. The `max_samples`\n",
+    "keyword argument specifies the number of items from the training set to use for\n",
+    "the base classifier. The smaller the `max_samples` used in the bagging\n",
+    "classifier, the better the out-of-sample error estimate, but at the cost of\n",
+    "worse in-sample performance. Of course, this depends on the overall number of\n",
+    "samples and the degrees-of-freedom in each individual classifier. The\n",
+    "VC-dimension surfaces again!\n",
+    "\n",
+    "## Boosting\n",
+    "\n",
+    "\n",
+    "As we discussed, bagging is particularly effective for individual high-variance\n",
+    "classifiers because the final majority-vote tends to smooth out the individual\n",
+    "classifiers and produce a more stable collaborative solution. On the other\n",
+    "hand, boosting is particularly effective for high-bias classifiers that are\n",
+    "slow to adjust to new data. On the one hand, boosting is similiar to bagging in\n",
+    "that it uses a majority-voting (or averaging for numeric prediction) process at\n",
+    "the end; and it also combines individual classifiers of the same type. On the\n",
+    "other hand, boosting is serially iterative, whereas the individual classifiers\n",
+    "in bagging can be trained in parallel.  Boosting uses the misclassifications of\n",
+    "prior iterations to influence the training of the next iterative classifier by\n",
+    "weighting those misclassifications more heavily in subsequent steps. This means\n",
+    "that, at every step, boosting focuses more and more on specific\n",
+    "misclassifications up to that point, letting the prior classifications\n",
+    "be carried by earlier iterations.\n",
+    "\n",
+    "\n",
+    "The primary implementation for boosting in Scikit-learn is the Adaptive\n",
+    "Boosting (*AdaBoost*) algorithm, which does classification\n",
+    "(`AdaBoostClassifier`) and regression (`AdaBoostRegressor`).  The first step in\n",
+    "the basic AdaBoost algorithm is to initialize the weights over each of the\n",
+    "training set indicies, $D_0(i)=1/n$ where there are $n$ elements in the\n",
+    "training set. Note that this creates a discrete uniform distribution over the\n",
+    "*indicies*, not over the training data $\\lbrace (x_i,y_i) \\rbrace$ itself. In\n",
+    "other words, if there are repeated elements in the training data, then each\n",
+    "gets its own weight. The next step is to train the base classifer $h_k$ and\n",
+    "record the classification error at the $k^{th}$ iteration, $\\epsilon_k$. Two\n",
+    "factors can next be calculated using $\\epsilon_k$,\n",
+    "\n",
+    "$$\n",
+    "\\alpha_k = \\frac{1}{2}\\log \\frac{1-\\epsilon_k}{\\epsilon_k}\n",
+    "$$\n",
+    "\n",
+    " and the normalization factor,\n",
+    "\n",
+    "$$\n",
+    "Z_k = 2 \\sqrt{ \\epsilon_k (1- \\epsilon_k) }\n",
+    "$$\n",
+    "\n",
+    " For the next step, the weights over the training data are updated as\n",
+    "in the following,\n",
+    "\n",
+    "$$\n",
+    "D_{k+1}(i) = \\frac{1}{Z_k} D_k(i)\\exp{(-\\alpha_k y_i h_k(x_i))}\n",
+    "$$\n",
+    "\n",
+    " The final classification result is assembled using the $\\alpha_k$\n",
+    "factors, $g = \\sgn(\\sum_{k} \\alpha_k h_k)$.\n",
+    "\n",
+    "To re-do the problem above using boosting with perceptrons, we set up the\n",
+    "AdaBoost classifier in the following,"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {
+    "attributes": {
+     "classes": [],
+     "id": "",
+     "n": "4"
+    }
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "AdaBoostClassifier(algorithm='SAMME',\n",
+       "          base_estimator=Perceptron(alpha=0.0001, class_weight=None, eta0=1.0, fit_intercept=True,\n",
+       "      max_iter=None, n_iter=None, n_jobs=1, penalty=None, random_state=0,\n",
+       "      shuffle=True, tol=None, verbose=0, warm_start=False),\n",
+       "          learning_rate=0.5, n_estimators=3, random_state=None)"
+      ]
+     },
+     "execution_count": 4,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "from sklearn.ensemble import AdaBoostClassifier\n",
+    "clf=AdaBoostClassifier(Perceptron(),n_estimators=3,\n",
+    "                       algorithm='SAMME',\n",
+    "                       learning_rate=0.5)\n",
+    "clf"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The `learning_rate` above controls how aggressively the weights are\n",
+    "updated. The resulting classification boundaries for the embedded perceptrons\n",
+    "are shown in [Figure](#fig:ensemble_004). Compare this to the lower right\n",
+    "panel in  [Figure](#fig:ensemble_003). The performance for both cases is about\n",
+    "the same.  The IPython notebook corresponding to this section has more details\n",
+    "and the full listing of code used to produce all these figures.\n",
+    "\n",
+    "<!-- dom:FIGURE: [fig-machine_learning/ensemble_004.png, width=500 frac=0.75]\n",
+    "The individual perceptron classifiers embedded in the AdaBoost classifier are\n",
+    "shown along with the mis-classified points (in black). Compare this to the lower\n",
+    "right panel of [Figure](#fig:ensemble_003). <div id=\"fig:ensemble_004\"></div>\n",
+    "-->\n",
+    "<!-- begin figure -->\n",
+    "<div id=\"fig:ensemble_004\"></div>\n",
+    "\n",
+    "<p>The individual perceptron classifiers embedded in the AdaBoost classifier are\n",
+    "shown along with the mis-classified points (in black). Compare this to the lower\n",
+    "right panel of [Figure](#fig:ensemble_003).</p>\n",
+    "<img src=\"fig-machine_learning/ensemble_004.png\" width=500>\n",
+    "\n",
+    "<!-- end figure -->"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.5.4"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}