1
Yossi Cohen
Machine Learning
with
Scikit-learn
2
INTRO TO ML PROGRAMMING
3
ML Programming
1. Get Data
Get labels for supervised learning
1. Create a classifier
2. Train the classifier
3. Predict test data
4. Evaluate predictor accuracy
*Configure and improve by repeating 2-5
4
The ML Process
Filter
Outliers
Regression
Classify
Validate
configure
Model
Partition
5
Get Data & Labels
• Sources
–Open data sources
–Collect on your own
• Verify data validity and correctness
• Wrangle data
–make it readable by computer
–Filter it
• Remove Outliers
PANDAS Python library could assist in pre-
processing & data manipulation before ML
http://pandas.pydata.org/
6
Pre-Processing
Change formatting
Remove redundant data
Filter Data (take partial data)
Remove Outliers
Label
Split for testing (10/90, 20/80)
7
Data Partitioning
• Data and labels
–{[data], [labels]}
–{[3,7, 76, 11, 22, 37, 56,2],[T, T, F, T, F, F, F, T]}
–Data: [Age, Do you love Nutella?]
• Partitioning will create
–{[train data], [train labels],[test data], [test labels]}
–We usually split the data on a ration of 9:1
–There is a tradeoff between the effectiveness of
the test and the learning we could provide to the
classifier
• We will look at a partitioning function later
8
Learn (The “Smart Part”)
Classification
If the output is discrete to a limited amount of
classes (groups)
Regression
If the output is continues
9
Learn Programming
10
Create Classifier
For most SUPERVISED LEARNING
algorithms this would be
C = ClassifyAlg(Params)
Its up to us (ML guys) to set the best
params
How?
1. We could develop a hunch for it
2. Perform an exhaustive search
11
Train the classifier
We assigned
C = ClassifyAlg(Params)
This is a general algorithm with some
initalizer and configurations.
In this stage we train it using:
C.fit(Data, Labels)
12
Predict
After we have a trained Algorithm
classifier C
Prdeicted_Labels = C.predict(Data)
13
Predictor Evaluation
We are not done yet
There is a need to evaluate the predictor
accuracy in comparison to other predictors
and to the system requirements
We will learn several methods for this
14
ENVIRONMENT
15
The Environment
• There are many existing environments and
tools we could use
–Matlab with Machine learning toolbox
–Apache Mahout
–Python with Scikit-learn
• Additional tools
–Hadoop / Map-Reduce to accelerate and
parallelize large data set processing
–Amazon ML tools
–NVIDIA Tools
16
Scikit-learn
• Installation Instructions in
http://scikit-learn.org/stable/install.html#install-official-release
• Depends on two other libraries
• numpy and scipy
• Easiest way to install on windows:
• Install WinPython
http://sourceforge.net/projects/winpython/files/WinPython_2.7/2.7.9.4/
–Lets install this together
For Linux / Mac computers just install the 3
libs separately using PIP
17
THE DATA
18
Data sets
There are many data sets to work on
One of them is the Iris data classification
into three groups. It has an interesting story
you could google later
Well work on the iris
data
19
Lab A – Plot the Iris data
Plot septal length vs septal width with labels
ONLY
How? Google Iris data and the scikit learn
environment
Try to understand the second part of the
program with the PCA
20
Iris Data
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from sklearn import datasets
iris = datasets.load_iris()
X = iris.data[:, :2] # we only take the first two features.
Y = iris.target
x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
21
Plot Iris Data
plt.figure(2, figsize=(8, 6))
plt.clf()
plt.scatter(X[:, 0], X[:, 1],
c=Y, cmap=plt.cm.Paired)
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
plt.xticks(())
plt.yticks(())
22
Add PCA for better classification
fig = plt.figure(1, figsize=(8, 6))
ax = Axes3D(fig, elev=-150, azim=110)
X_reduced = PCA(n_components=3).fit_transform(iris.data)
ax.scatter(X_reduced[:, 0], X_reduced[:, 1], X_reduced[:, 2], c=Y,
cmap=plt.cm.Paired)
ax.set_title("First three PCA directions")
ax.set_xlabel("1st eigenvector")
ax.w_xaxis.set_ticklabels([])
ax.set_ylabel("2nd eigenvector")
ax.w_yaxis.set_ticklabels([])
ax.set_zlabel("3rd eigenvector")
ax.w_zaxis.set_ticklabels([])
plt.show()
23
Iris Data Classified
24
25
Thank you!
More About me:
Yossi CohenYossi Cohen
yossicohen19@gmail.comyossicohen19@gmail.com
+972-545-313092+972-545-313092
 Video compression and computer vision enthusiast & lecturer
 Surfer

Intro to machine learning with scikit learn

  • 1.
  • 2.
    2 INTRO TO MLPROGRAMMING
  • 3.
    3 ML Programming 1. GetData Get labels for supervised learning 1. Create a classifier 2. Train the classifier 3. Predict test data 4. Evaluate predictor accuracy *Configure and improve by repeating 2-5
  • 4.
  • 5.
    5 Get Data &Labels • Sources –Open data sources –Collect on your own • Verify data validity and correctness • Wrangle data –make it readable by computer –Filter it • Remove Outliers PANDAS Python library could assist in pre- processing & data manipulation before ML http://pandas.pydata.org/
  • 6.
    6 Pre-Processing Change formatting Remove redundantdata Filter Data (take partial data) Remove Outliers Label Split for testing (10/90, 20/80)
  • 7.
    7 Data Partitioning • Dataand labels –{[data], [labels]} –{[3,7, 76, 11, 22, 37, 56,2],[T, T, F, T, F, F, F, T]} –Data: [Age, Do you love Nutella?] • Partitioning will create –{[train data], [train labels],[test data], [test labels]} –We usually split the data on a ration of 9:1 –There is a tradeoff between the effectiveness of the test and the learning we could provide to the classifier • We will look at a partitioning function later
  • 8.
    8 Learn (The “SmartPart”) Classification If the output is discrete to a limited amount of classes (groups) Regression If the output is continues
  • 9.
  • 10.
    10 Create Classifier For mostSUPERVISED LEARNING algorithms this would be C = ClassifyAlg(Params) Its up to us (ML guys) to set the best params How? 1. We could develop a hunch for it 2. Perform an exhaustive search
  • 11.
    11 Train the classifier Weassigned C = ClassifyAlg(Params) This is a general algorithm with some initalizer and configurations. In this stage we train it using: C.fit(Data, Labels)
  • 12.
    12 Predict After we havea trained Algorithm classifier C Prdeicted_Labels = C.predict(Data)
  • 13.
    13 Predictor Evaluation We arenot done yet There is a need to evaluate the predictor accuracy in comparison to other predictors and to the system requirements We will learn several methods for this
  • 14.
  • 15.
    15 The Environment • Thereare many existing environments and tools we could use –Matlab with Machine learning toolbox –Apache Mahout –Python with Scikit-learn • Additional tools –Hadoop / Map-Reduce to accelerate and parallelize large data set processing –Amazon ML tools –NVIDIA Tools
  • 16.
    16 Scikit-learn • Installation Instructionsin http://scikit-learn.org/stable/install.html#install-official-release • Depends on two other libraries • numpy and scipy • Easiest way to install on windows: • Install WinPython http://sourceforge.net/projects/winpython/files/WinPython_2.7/2.7.9.4/ –Lets install this together For Linux / Mac computers just install the 3 libs separately using PIP
  • 17.
  • 18.
    18 Data sets There aremany data sets to work on One of them is the Iris data classification into three groups. It has an interesting story you could google later Well work on the iris data
  • 19.
    19 Lab A –Plot the Iris data Plot septal length vs septal width with labels ONLY How? Google Iris data and the scikit learn environment Try to understand the second part of the program with the PCA
  • 20.
    20 Iris Data import matplotlib.pyplotas plt from mpl_toolkits.mplot3d import Axes3D from sklearn import datasets iris = datasets.load_iris() X = iris.data[:, :2] # we only take the first two features. Y = iris.target x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5 y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
  • 21.
    21 Plot Iris Data plt.figure(2,figsize=(8, 6)) plt.clf() plt.scatter(X[:, 0], X[:, 1], c=Y, cmap=plt.cm.Paired) plt.xlabel('Sepal length') plt.ylabel('Sepal width') plt.xlim(x_min, x_max) plt.ylim(y_min, y_max) plt.xticks(()) plt.yticks(())
  • 22.
    22 Add PCA forbetter classification fig = plt.figure(1, figsize=(8, 6)) ax = Axes3D(fig, elev=-150, azim=110) X_reduced = PCA(n_components=3).fit_transform(iris.data) ax.scatter(X_reduced[:, 0], X_reduced[:, 1], X_reduced[:, 2], c=Y, cmap=plt.cm.Paired) ax.set_title("First three PCA directions") ax.set_xlabel("1st eigenvector") ax.w_xaxis.set_ticklabels([]) ax.set_ylabel("2nd eigenvector") ax.w_yaxis.set_ticklabels([]) ax.set_zlabel("3rd eigenvector") ax.w_zaxis.set_ticklabels([]) plt.show()
  • 23.
  • 24.
  • 25.
    25 Thank you! More Aboutme: Yossi CohenYossi Cohen yossicohen19@gmail.comyossicohen19@gmail.com +972-545-313092+972-545-313092  Video compression and computer vision enthusiast & lecturer  Surfer