Skip to content

Commit d0ad4f1

Browse files
committed
added more ML notebooks
1 parent 8f6fc65 commit d0ad4f1

File tree

6 files changed

+4501
-0
lines changed

6 files changed

+4501
-0
lines changed

chapters/machine_learning/notebooks/clustering.ipynb

Lines changed: 496 additions & 0 deletions
Large diffs are not rendered by default.
Lines changed: 385 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,385 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {},
6+
"source": [
7+
"<!-- new sections -->\n",
8+
"<!-- Ensemble learning -->\n",
9+
"<!-- - Machine Learning Flach, Ch.11 -->\n",
10+
"<!-- - Machine Learning Mohri, pp.135- -->\n",
11+
"<!-- - Data Mining Witten, Ch. 8 -->"
12+
]
13+
},
14+
{
15+
"cell_type": "code",
16+
"execution_count": 1,
17+
"metadata": {},
18+
"outputs": [
19+
{
20+
"data": {
21+
"image/png": "../../../python_for_probability_statistics_and_machine_learning.jpg",
22+
"text/plain": [
23+
"<IPython.core.display.Image object>"
24+
]
25+
},
26+
"execution_count": 1,
27+
"metadata": {},
28+
"output_type": "execute_result"
29+
}
30+
],
31+
"source": [
32+
"from IPython.display import Image \n",
33+
"Image('../../../python_for_probability_statistics_and_machine_learning.jpg')"
34+
]
35+
},
36+
{
37+
"cell_type": "code",
38+
"execution_count": 1,
39+
"metadata": {
40+
"attributes": {
41+
"classes": [],
42+
"id": "",
43+
"n": "1"
44+
},
45+
"collapsed": true
46+
},
47+
"outputs": [],
48+
"source": [
49+
"from pprint import pprint\n",
50+
"import textwrap\n",
51+
"import sys, re\n",
52+
"def displ(x):\n",
53+
" if x is None: return\n",
54+
" print (\"\\n\".join(textwrap.wrap(repr(x).replace(' ',''),width=80)))\n",
55+
"\n",
56+
"sys.displayhook=displ"
57+
]
58+
},
59+
{
60+
"cell_type": "markdown",
61+
"metadata": {},
62+
"source": [
63+
"With the exception of the random forest, we have so far considered machine\n",
64+
"learning models as stand-alone entities. Combinations of models that jointly\n",
65+
"produce a classification are known as *ensembles*. There are two main\n",
66+
"methodologies that create ensembles: *bagging* and *boosting*.\n",
67+
"\n",
68+
"## Bagging\n",
69+
"\n",
70+
"Bagging refers to bootstrap aggregating, where bootstrap here is the same as we\n",
71+
"discussed in the section [ch:stats:sec:boot](#ch:stats:sec:boot). Basically,\n",
72+
"we resample the data with replacement and then train a classifier on the newly\n",
73+
"sampled data. Then, we combine the outputs of each of the individual\n",
74+
"classifiers using a majority-voting scheme (for discrete outputs) or a weighted\n",
75+
"average (for continuous outputs). This combination is particularly effective\n",
76+
"for models that are easily influenced by a single data element. The resampling\n",
77+
"process means that these elements cannot appear in every bootstrapped\n",
78+
"training set so that some of the models will not suffer these effects. This\n",
79+
"makes the so-computed combination of outputs less volatile. Thus, bagging\n",
80+
"helps reduce the collective variance of individual high-variance models.\n",
81+
"\n",
82+
"To get a sense of bagging, let's suppose we have a two-dimensional plane that\n",
83+
"is partitioned into two regions with the following boundary: $y=-x+x^2$.\n",
84+
"Pairs of $(x_i,y_i)$ points above this boundary are labeled one and points\n",
85+
"below are labeled zero. [Figure](#fig:ensemble_001) shows the two regions\n",
86+
"with the nonlinear separating boundary as the black curved line.\n",
87+
"\n",
88+
"<!-- dom:FIGURE: [fig-machine_learning/ensemble_001.png, width=500 frac=0.75]\n",
89+
"Two regions in the plane are separated by a nonlinear boundary. The training\n",
90+
"data is sampled from this plane. The objective is to correctly classify the so-\n",
91+
"sampled data. <div id=\"fig:ensemble_001\"></div> -->\n",
92+
"<!-- begin figure -->\n",
93+
"<div id=\"fig:ensemble_001\"></div>\n",
94+
"\n",
95+
"<p>Two regions in the plane are separated by a nonlinear boundary. The training\n",
96+
"data is sampled from this plane. The objective is to correctly classify the so-\n",
97+
"sampled data.</p>\n",
98+
"<img src=\"fig-machine_learning/ensemble_001.png\" width=500>\n",
99+
"\n",
100+
"<!-- end figure -->\n",
101+
"\n",
102+
"\n",
103+
"\n",
104+
"\n",
105+
"The problem is to take samples from each of these regions and\n",
106+
"classify them correctly using a perceptron. A perceptron is the simplest\n",
107+
"possible linear classifier that finds a line in the plane to separate two\n",
108+
"purported categories. Because the separating boundary is nonlinear, there is no\n",
109+
"way that the perceptron can completely solve this problem. The following code\n",
110+
"sets up the perceptron available in Scikit-learn."
111+
]
112+
},
113+
{
114+
"cell_type": "code",
115+
"execution_count": 2,
116+
"metadata": {
117+
"attributes": {
118+
"classes": [],
119+
"id": "",
120+
"n": "2"
121+
}
122+
},
123+
"outputs": [
124+
{
125+
"data": {
126+
"text/plain": [
127+
"Perceptron(alpha=0.0001, class_weight=None, eta0=1.0, fit_intercept=True,\n",
128+
" max_iter=None, n_iter=None, n_jobs=1, penalty=None, random_state=0,\n",
129+
" shuffle=True, tol=None, verbose=0, warm_start=False)"
130+
]
131+
},
132+
"execution_count": 2,
133+
"metadata": {},
134+
"output_type": "execute_result"
135+
}
136+
],
137+
"source": [
138+
"from sklearn.linear_model import Perceptron\n",
139+
"p=Perceptron()\n",
140+
"p"
141+
]
142+
},
143+
{
144+
"cell_type": "markdown",
145+
"metadata": {},
146+
"source": [
147+
"The training data and the resulting perceptron separating boundary\n",
148+
"are shown in [Figure](#fig:ensemble_002). The circles and crosses are the\n",
149+
"sampled training data and the gray separating line is the perceptron's\n",
150+
"separating boundary between the two categories. The black squares are those\n",
151+
"elements in the training data that the perceptron mis-classified. Because the\n",
152+
"perceptron can only produce linear separating boundaries, and the boundary in\n",
153+
"this case is non-linear, the perceptron makes mistakes near where the\n",
154+
"boundary curves. The next step is to see how bagging can\n",
155+
"improve upon this by using multiple perceptrons.\n",
156+
"\n",
157+
"<!-- dom:FIGURE: [fig-machine_learning/ensemble_002.png, width=500 frac=0.75]\n",
158+
"The perceptron finds the best linear boundary between the two classes. <div\n",
159+
"id=\"fig:ensemble_002\"></div> -->\n",
160+
"<!-- begin figure -->\n",
161+
"<div id=\"fig:ensemble_002\"></div>\n",
162+
"\n",
163+
"<p>The perceptron finds the best linear boundary between the two classes.</p>\n",
164+
"<img src=\"fig-machine_learning/ensemble_002.png\" width=500>\n",
165+
"\n",
166+
"<!-- end figure -->\n",
167+
"\n",
168+
"\n",
169+
"The following code sets up the bagging classifier in Scikit-learn. Here we\n",
170+
"select only three perceptrons. [Figure](#fig:ensemble_003) shows each of the\n",
171+
"three individual classifiers and the final bagged classifer in the panel on the\n",
172+
"bottom right. As before, the black circles indicate misclassifications in the\n",
173+
"training data. Joint classifications are determined by majority voting."
174+
]
175+
},
176+
{
177+
"cell_type": "code",
178+
"execution_count": 3,
179+
"metadata": {
180+
"attributes": {
181+
"classes": [],
182+
"id": "",
183+
"n": "3"
184+
}
185+
},
186+
"outputs": [
187+
{
188+
"data": {
189+
"text/plain": [
190+
"BaggingClassifier(base_estimator=Perceptron(alpha=0.0001, class_weight=None, eta0=1.0, fit_intercept=True,\n",
191+
" max_iter=None, n_iter=None, n_jobs=1, penalty=None, random_state=0,\n",
192+
" shuffle=True, tol=None, verbose=0, warm_start=False),\n",
193+
" bootstrap=True, bootstrap_features=False, max_features=1.0,\n",
194+
" max_samples=0.5, n_estimators=3, n_jobs=1, oob_score=False,\n",
195+
" random_state=None, verbose=0, warm_start=False)"
196+
]
197+
},
198+
"execution_count": 3,
199+
"metadata": {},
200+
"output_type": "execute_result"
201+
}
202+
],
203+
"source": [
204+
"from sklearn.ensemble import BaggingClassifier\n",
205+
"bp = BaggingClassifier(Perceptron(),max_samples=0.50,n_estimators=3)\n",
206+
"bp"
207+
]
208+
},
209+
{
210+
"cell_type": "markdown",
211+
"metadata": {},
212+
"source": [
213+
"<!-- dom:FIGURE: [fig-machine_learning/ensemble_003.png, width=500 frac=0.85]\n",
214+
"Each panel with the single gray line is one of the perceptrons used for the\n",
215+
"ensemble bagging classifier on the lower right. <div\n",
216+
"id=\"fig:ensemble_003\"></div> -->\n",
217+
"<!-- begin figure -->\n",
218+
"<div id=\"fig:ensemble_003\"></div>\n",
219+
"\n",
220+
"<p>Each panel with the single gray line is one of the perceptrons used for the\n",
221+
"ensemble bagging classifier on the lower right.</p>\n",
222+
"<img src=\"fig-machine_learning/ensemble_003.png\" width=500>\n",
223+
"\n",
224+
"<!-- end figure -->\n",
225+
"\n",
226+
"\n",
227+
"The `BaggingClassifier` can estimate its own out-of-sample error if passed the\n",
228+
"`oob_score=True` flag upon construction. This keeps track of which samples were\n",
229+
"used for training and which were not, and then estimates the out-of-sample\n",
230+
"error using those samples that were unused in training. The `max_samples`\n",
231+
"keyword argument specifies the number of items from the training set to use for\n",
232+
"the base classifier. The smaller the `max_samples` used in the bagging\n",
233+
"classifier, the better the out-of-sample error estimate, but at the cost of\n",
234+
"worse in-sample performance. Of course, this depends on the overall number of\n",
235+
"samples and the degrees-of-freedom in each individual classifier. The\n",
236+
"VC-dimension surfaces again!\n",
237+
"\n",
238+
"## Boosting\n",
239+
"\n",
240+
"\n",
241+
"As we discussed, bagging is particularly effective for individual high-variance\n",
242+
"classifiers because the final majority-vote tends to smooth out the individual\n",
243+
"classifiers and produce a more stable collaborative solution. On the other\n",
244+
"hand, boosting is particularly effective for high-bias classifiers that are\n",
245+
"slow to adjust to new data. On the one hand, boosting is similiar to bagging in\n",
246+
"that it uses a majority-voting (or averaging for numeric prediction) process at\n",
247+
"the end; and it also combines individual classifiers of the same type. On the\n",
248+
"other hand, boosting is serially iterative, whereas the individual classifiers\n",
249+
"in bagging can be trained in parallel. Boosting uses the misclassifications of\n",
250+
"prior iterations to influence the training of the next iterative classifier by\n",
251+
"weighting those misclassifications more heavily in subsequent steps. This means\n",
252+
"that, at every step, boosting focuses more and more on specific\n",
253+
"misclassifications up to that point, letting the prior classifications\n",
254+
"be carried by earlier iterations.\n",
255+
"\n",
256+
"\n",
257+
"The primary implementation for boosting in Scikit-learn is the Adaptive\n",
258+
"Boosting (*AdaBoost*) algorithm, which does classification\n",
259+
"(`AdaBoostClassifier`) and regression (`AdaBoostRegressor`). The first step in\n",
260+
"the basic AdaBoost algorithm is to initialize the weights over each of the\n",
261+
"training set indicies, $D_0(i)=1/n$ where there are $n$ elements in the\n",
262+
"training set. Note that this creates a discrete uniform distribution over the\n",
263+
"*indicies*, not over the training data $\\lbrace (x_i,y_i) \\rbrace$ itself. In\n",
264+
"other words, if there are repeated elements in the training data, then each\n",
265+
"gets its own weight. The next step is to train the base classifer $h_k$ and\n",
266+
"record the classification error at the $k^{th}$ iteration, $\\epsilon_k$. Two\n",
267+
"factors can next be calculated using $\\epsilon_k$,\n",
268+
"\n",
269+
"$$\n",
270+
"\\alpha_k = \\frac{1}{2}\\log \\frac{1-\\epsilon_k}{\\epsilon_k}\n",
271+
"$$\n",
272+
"\n",
273+
" and the normalization factor,\n",
274+
"\n",
275+
"$$\n",
276+
"Z_k = 2 \\sqrt{ \\epsilon_k (1- \\epsilon_k) }\n",
277+
"$$\n",
278+
"\n",
279+
" For the next step, the weights over the training data are updated as\n",
280+
"in the following,\n",
281+
"\n",
282+
"$$\n",
283+
"D_{k+1}(i) = \\frac{1}{Z_k} D_k(i)\\exp{(-\\alpha_k y_i h_k(x_i))}\n",
284+
"$$\n",
285+
"\n",
286+
" The final classification result is assembled using the $\\alpha_k$\n",
287+
"factors, $g = \\sgn(\\sum_{k} \\alpha_k h_k)$.\n",
288+
"\n",
289+
"To re-do the problem above using boosting with perceptrons, we set up the\n",
290+
"AdaBoost classifier in the following,"
291+
]
292+
},
293+
{
294+
"cell_type": "code",
295+
"execution_count": 4,
296+
"metadata": {
297+
"attributes": {
298+
"classes": [],
299+
"id": "",
300+
"n": "4"
301+
}
302+
},
303+
"outputs": [
304+
{
305+
"data": {
306+
"text/plain": [
307+
"AdaBoostClassifier(algorithm='SAMME',\n",
308+
" base_estimator=Perceptron(alpha=0.0001, class_weight=None, eta0=1.0, fit_intercept=True,\n",
309+
" max_iter=None, n_iter=None, n_jobs=1, penalty=None, random_state=0,\n",
310+
" shuffle=True, tol=None, verbose=0, warm_start=False),\n",
311+
" learning_rate=0.5, n_estimators=3, random_state=None)"
312+
]
313+
},
314+
"execution_count": 4,
315+
"metadata": {},
316+
"output_type": "execute_result"
317+
}
318+
],
319+
"source": [
320+
"from sklearn.ensemble import AdaBoostClassifier\n",
321+
"clf=AdaBoostClassifier(Perceptron(),n_estimators=3,\n",
322+
" algorithm='SAMME',\n",
323+
" learning_rate=0.5)\n",
324+
"clf"
325+
]
326+
},
327+
{
328+
"cell_type": "markdown",
329+
"metadata": {},
330+
"source": [
331+
"The `learning_rate` above controls how aggressively the weights are\n",
332+
"updated. The resulting classification boundaries for the embedded perceptrons\n",
333+
"are shown in [Figure](#fig:ensemble_004). Compare this to the lower right\n",
334+
"panel in [Figure](#fig:ensemble_003). The performance for both cases is about\n",
335+
"the same. The IPython notebook corresponding to this section has more details\n",
336+
"and the full listing of code used to produce all these figures.\n",
337+
"\n",
338+
"<!-- dom:FIGURE: [fig-machine_learning/ensemble_004.png, width=500 frac=0.75]\n",
339+
"The individual perceptron classifiers embedded in the AdaBoost classifier are\n",
340+
"shown along with the mis-classified points (in black). Compare this to the lower\n",
341+
"right panel of [Figure](#fig:ensemble_003). <div id=\"fig:ensemble_004\"></div>\n",
342+
"-->\n",
343+
"<!-- begin figure -->\n",
344+
"<div id=\"fig:ensemble_004\"></div>\n",
345+
"\n",
346+
"<p>The individual perceptron classifiers embedded in the AdaBoost classifier are\n",
347+
"shown along with the mis-classified points (in black). Compare this to the lower\n",
348+
"right panel of [Figure](#fig:ensemble_003).</p>\n",
349+
"<img src=\"fig-machine_learning/ensemble_004.png\" width=500>\n",
350+
"\n",
351+
"<!-- end figure -->"
352+
]
353+
},
354+
{
355+
"cell_type": "code",
356+
"execution_count": null,
357+
"metadata": {
358+
"collapsed": true
359+
},
360+
"outputs": [],
361+
"source": []
362+
}
363+
],
364+
"metadata": {
365+
"kernelspec": {
366+
"display_name": "Python 3",
367+
"language": "python",
368+
"name": "python3"
369+
},
370+
"language_info": {
371+
"codemirror_mode": {
372+
"name": "ipython",
373+
"version": 3
374+
},
375+
"file_extension": ".py",
376+
"mimetype": "text/x-python",
377+
"name": "python",
378+
"nbconvert_exporter": "python",
379+
"pygments_lexer": "ipython3",
380+
"version": "3.5.4"
381+
}
382+
},
383+
"nbformat": 4,
384+
"nbformat_minor": 2
385+
}

0 commit comments

Comments
 (0)