Jupyter Notebook and
Machine Learning 101
Materials: Karlijn Willems. 2019. Jupyter Notebook Tutorial. The definitive Guide. DataCamp
Tairi Delgado. 2018. Hands-On Data Analytics for Beginners with Google Colaboratory
Randy Olson.2018.Example Machine Learning Notebook.
Sarah Guido and Andreas Müller. 2016. Introduction to Machine Learning with Python. O’Reilly Media, Inc
Vahid Mirjalili and Sebastian Raschka. 2019. Python Machine Learning. Packt Publishing
1
Olga Scrivner @obscrivn
https://www.linkedin.com/in/olgascrivner/
Research Scientist, CNS
Visiting Lecturer, Data Science, IU
Corporate Faculty, Data Analytics, HU
Outline
Part 1 – Introduction to Jupyter Notebook
Part 2 – Working with data
Break
Part 3 – Machine Learning Fundamentals
- Classification
- Regression
- Clustering
2
Introduction to Jupyter Notebook
If you know Jupyter – skip this section
3
What is Jupyter Notebook?
Notebook
4
Source: (Wiillems, 2019)
Jupyter
- Documents with Code and Text
elements (text narrative, media,
figure, equations)
- Data analysis and description in one
single document and in real time
- An acronym for Julia, Python, and R
(first programming languages
supported by Jupyter Application)
Text
Code Figures
Jupyter Notebook Components
Editing and running documents via a web browser
5
Source: (Wiillems, 2019)
Jupyter Notebook
Kernel Dashboard
A program running
the user’s code
A front-end interface showing
input, output, files etc.
History
IPython – a web notebook system was
released
6
2011
2001 Fernandao Pérez started working on a
notebook system
2014 Project Jupyter incorporated IPython
Project JupyterLab is introduced2018
Source: Project Jupyter. 2018
Jupyter Notebooks
7
ANACONDA
GOOGLE COLAB
- Collaboration
- Must sign into Google account
- Storage on Google Drive
- No installation
- Off-line
- Local access
- Anaconda Cloud for sharing
- Installation comes with
many useful libraries
Jupyter Project
8
http://www.jupyter.org
nbviewer
Select Book
Select Chapter 1
Find Code blocks, Text narratives, Figures
Notebook Structure
10
Text block
Code block
Section header
IPYNB
11
.ipynb file is a text file
that describes the
contents of your
notebook in a format
called JSON.
Running Jupyter Notebook
Anaconda
12
Anaconda
13
- Install Anaconda (Python distribution for data
science with popular libraries and tools)
- Download the latest version of Anaconda for
Python 3 https://anaconda.org/
Anaconda installs both Python and
Jupyter Notebook
Anaconda Navigator
14
Search for the Anaconda
Navigator Icon
Start Menu:
Type –
anaconda navigator
JupyterLab Interface
15
source: https://jupyterlab.readthedocs.io/en/latest/user/interface.html
File Browser
Current Running Notebooks
Open Tabs
Collapsible Left menu – on Click
Create a New Folder
16
Rename Folder
Select Folder (Click on the Folder)
New File New Folder Upload
Check your path
Select Desktop or Documents
Create a New File Notebook
17
New File New Folder Upload
Select New File
Select Notebook Python 3
Rename (Right Click)
Review
18
Save Add
cell
Delete
cell
Copy
cell
Stop
ExecutionRun
cell
Paste
cell
6. RUN cell 1
7. RUN cell 2
Cells: Practice
19
1. Create NEW cell
2. Type first_name
single or double quotes
cell is
highlighted when
selected
Numbers – the order of execution
3. Create NEW cell
4. Type print(first_name) – second line
5. Click inside the first cell
Copying and Deleting Cells: Practice
20
1. Make a new cell
2. Copy this cell
3. Paste this cell
4. Place the cursor in your first copy
5. Delete this cell
Quiz Question!
21
What is the difference between two brackets?
Cell is executed
Cell is NOT executed
Code and Markdown
22
We have used code so far (it is selected by default).
1. Create a NEW cell
2. Click on the drop-down menu
3. Select Markdown
Markdown Headers: Practice
23
1. Create a new cell
2. Select Markdown (instead of Code)
3. Create a header # My Report
4. Run
5. Create a new cell
6. Create a sub-header ## Introduction
7. Run
Files for Workshop
1. Download files from
https://languagevariationsuite.wordpress.com/
2. Place these files into today’s Folder for CrashCourse
3. Go back to JupyterLab (Left Menu), Find Exploring Google
Collaboration notebook (orange icon) – double click on the file
name
24Green – Running Notebook
Orange – ipynb files
You might
be asked
first time to
choose
Kernel –
Select
Python 3
Can we keep both files open? YES!
25
1. Select a file you want to add to the main dashboard.
2. Drag and Drop the file
3. Adjust the width as needed
Markdown: Practice
26
1. Click inside the Markdown cell to
view the structure (Google colab
notebook)
2. Reproduce the code in your own
Notebook (Until Latex Tutorial)
Step 1. Delete your current cells
Step 2. Create new cell
Step 3. Select Markdown
Step 4. Type the content
Step 5. Run
Latex Tutorial – homework J
When Done – Save your Notebook and Close
Working with Data
27
Python Libraries
pandas
28
matplotlib
NumPy
a library for data wrangling and analysis. The main data
structure is Data Frame (similar to a spreadsheet)
pd a common short name
import pandas as pd
Instead of pandas.DataFrame()
you can use pd.DataFrame()
a library for linear algebra and scientific computing. The main
data structure is array (matrix).
np import numpy as np
a scientific plotting library
Sarah Guido and Andreas Müller. 2016. Introduction to Machine Learning with Python. Chapter 1.
%matplotlib inline
import matplotlib.pyplot as plt
recommended to view
figures in the browser
plt
Import Python Libraries
1. Create a new notebook
2. Drag and drop Exploring your Data.ipynb
3. You should have a split screen with your file and Exploring Data
4. Import libraries (copy/paste or type in) as shown into your
notebook and Run
29
Import Dataset: iris
Double click on dataset
30
You can switch between tabs
Import dataset
setosa, versicolor, virginica
petals, sepals – length, width
Data Description – Numeric Data
31
describe() Returns count, mean, sd, min, max, quantiles
Something is wrong
datasetname.describe()
Which column is missing?
Data Description – Categorical Data
32
value_counts() Returns counts for categorical values
Something is wrong
datasetname[‘columnname’].value_counts()
count()
Observations
Data Cleaning
replace()
33
replace by Iris-versicolor
replace by Iris-setosa
datasetname.replace(expression, substitution)
datasetname[‘column’].replace(expression, substitution)
Why CLASS has quotes?
Missing Data
34
isnull() returns false or true
isnull().sum() returns counts of NA
datasetname.isnull().sum()
Options:
1. Delete all 5 records
2. Replace 5 records with another valueLet’s try both!
Dropping NA
35
Creating a new dataset
Replacing NA by MEAN
36
1. Calculate MEAN for petal_width_cm
2. Replace missing values by MEAN
np.mean() Do you
remember what
NP is?
np.mean(datasetname[‘columnname’])
fillna()
datasetname[‘columnname’].fillna(value)
Create a new dataset (copy of iris)
Replace values by mean
inplace = changes will modify the current dataset
Saving New Files
37
Let’s save clean iris_drop and iris_fill as two new csv files (for future use)
to_csv()
datasetname.to_csv(“filename.csv”, index=False)
Remove index numbers from the CSV
How do I know if a library is installed by default in Anaconda?
38
Back to Anaconda Navigator
Look for a library in Installed or Not installed Use a search window
Practice: Find the library - seaborn
Data Visualization
39
a library for statistical
visualization
pairplot()
sb.pairplot(datasetname, hue = ‘columnname’)
Diagonal – the distribution of each variable
Off-diagonal – scatterplots between two variables
Histograms
40
datasetname[‘columnname’].hist()hist()
Something is wrong!!
Boxplot – Do We Have Outliers?
41
boxplot(x,y)
datasetname.boxplot(by, columnname)
What is PLT?
Remember –
you can use
single or
double
quotes
What can we tell about Outlier?
Selecting Outlier
42
(Condition 1) & (Condition 2)
IRIS-VERSICOLORand
iris_drop[‘sepal_length_cm’ <1] iris[‘class’ ==’Iris-versicolor’]&iris_drop
Removing Outliers
43
~ tilde = NOT
dataset that is NOT (Condition 1 & Condition 2)
iris_drop.to_csv("iris_drop.csv", index = False)
save the current version of iris-drop
Machine Learning Concepts
44
What is Machine Learning?
Extracting
Knowledge from data
45
(Google Developers. Machine Learning. 2020)
source: Aleksii Kharkovyna. 2019. A Beginner’s Guide to Data Science.
AI – Machine’s ability to solve problems
ML – Machine’s ability to learn
DL – a technique for ML imitating brain neural network
Michael Garbade. 2018. Clearing the Confusion : AI vs Machine Learning vs Deep Learning Differences
Types of Machine Learning
46
Input and Output
provided during
training (learning)
No training
output
provided. The
purpose is to
model
underlying
structure of
data
Modeling the
relationship between
measured data
(INPUT) and labels
(OUTPUT)
The model is applied
to new data to
predict a LABEL (e.g.
spam /no-spam)
Discrete
Labels
Continuous
Values
JAkeVaderPlas. 2016. Python Data Science Handbook. O’Reilly Media
N
oLabels
N
oLabels
Scikit-Learn
47
Python modules for Machine Learning and Data Miningscikit-learn
https://scikit-learn.org/stable/
Classification
48
Classification
49
Labels
Measurements
Can we predict species of flower from 4 measurements (sepal length and
width, petal length and width)?
Sarah Guido and Andreas Müller. 2016. Introduction to Machine Learning with Python
Training and Testing Data
50
The sample of data used for model evaluation: the
model has never seen this set
Test Data or
Hold-Out Set
Training Data
or Training Set
The sample of data used to build/train models
~80%
~20%
Naming Notation:
X - dataset
y - labels
X_train X_test
y_train y_test
Sarah Guido and Andreas Müller. 2016. Introduction to Machine Learning with Python
Splitting into Training and Test Data
51
Step 1: Split dataset into FEATURES (X) and LABELS (y)
Removing the column ‘class’
Step 2: Split into 80% training and 20% test data
Sarah Guido and Andreas Müller. 2016. Introduction to Machine Learning with Python Sergii Boiko. 2018. Pandas Axis Explained.
train_test_split()
20% test data
labelsfeatures
Decision Tree Classification
52
JAkeVaderPlas. 2016. Python Data Science Handbook. O’Reilly Media
A decision tree splits data
iteratively and assigns a
binary (yes/no) or numeric
value label
Step 1: Create the classifier Step 2: Train (FIT) the classifier
Jebaseelan Ravi. 2018. Machine Learning Iris Classification.
Step 3: Validate the classifier
Remember we must use
TEST data (unseen) for
validation
Prediction Evaluation
53
Step 4: Make predictions with Test data
Step 5: Evaluate predictions
What is your score?
Making Predictions – New Array
54
We found an Iris with a sepal length = 5cm, a sepal width = 2.9cm, a petal length =
1cm, and a petal width = 0.2.
NumPy (Numerical Python) – mathematical
computation on arrays and matrices
Multidimensional arrays with Homogenous
elements (usually integers)
Compare: Data Frame – 2x2
with heterogenous elements
A 1D array is a vector: the shape is just the number of components.
A 2D array is a matrix: the shape is (number of rows, number of columns).
A 3D array: the shape is (number of frames, rows in each frame, columns in each frame)
Arrays
55
1D Array
2D Array
Two arrays
with 3
elements each
One array with
3 elements
3D Array
Three frames
with 2 arrays
and 2
elements each
Arrays: Practice
56
Check the dimensions for X_test, X_train, y_test, y_train:
- how may columns are in X_train?
- What is the size of X_train and X_test?
- How many labels are in y_test and y_train?
- Which datasets are 2D and which ones are 1D?
2D 1D
Clustering
57
Unsupervised Learning
58
Grouping observations into a number of groups (k-groups) based on
similar characteristics
- Discovering underlying patterns
- Grouping similar data together
- Finding meaningful structure
“learning by observation”: no label information is known
Source: CS 412 Jiawei Han (2018)
How Many Clusters?
59
K-Means Cluster
60
K Means – 1) Choosing a number of
K clusters and 2) Assigning Centroids
(mean)
We are using the entire data (without split)
and we do not need labels.
Predict Clusters
61
Let’s create a scatterplot with new labels!
Create a new column
Replace Integers by
Species names
Use labels for color coding
Note: While similar to Classification, the labels have no a priori meaning
Sarah Guido and Andreas Müller. 2016. Introduction to Machine Learning with Python
Label assignment is random.
you can use
your own
labels
Scatterplots
62
Old LabelsPredicted Labels
Hierarchical Cluster
63
Sarah Guido and Andreas Müller. 2016. Introduction to Machine Learning with Python
Partitional Clustering – NO overlapping points (K-Means)
Hierarchical– overlapping points
Dendrogram
based on similarity distance
based on dissimilarity distance
KMeans – we measure a distance
between two pairs of observations
Hierarchical Cluster – we measure a
distance between two clusters
Hierarchical Cluster – Linkage Methods
64
UC Business Analytics R Programming Guide. Hierarchical Cluster Analysis.
Ward – a linkage method
that minimizes the total
within-cluster variance
We do not need labels for clustering
but we will use them for dendrogram
plotting.
Summary
65
Sarah Guido and Andreas Müller. 2016. Introduction to Machine Learning with Python
Let’s save your Figure!
Summary
66
Doaa Taha. 2018. Lecture slides
Regression
67
Regression
68
Vahid Mirjalili and Sebastian Raschka. 2019. Python Machine Learning. Packt Publishing
Modeling the relationship between
one or multiple features
(independent variables
or predictors)
a target (response or
predicted)
Regression Types
69
Independent
Variables
Regression Line
Shape
Dependent
variable
Simple
Multiple
1 Independent
> 1 Independent
Linear Continuous
Linear
Ridge Highly correlated
Logistic Binary
Nominal > 2 categories
Poisson Count
Quadratic
Curvilinear
Logistic
Stepwise Identification of
best variables
Lasso
Ordinal Ordered response
Multivariate > 1 dependentCh.4 Multivariate Data Analysis. Joseph Hair et al. 2010. Pearson
Linear Regression Assumptions
70
Ch.4 Multivariate Data Analysis. Joseph Hair et al. 2010. Pearson
3. Outliers: There should be no significant outliers
(Ch.13 Applied Statistics in R. David Dalpiaz)
2. Linear: The relationship between Y and x is linear
Normal: The errors ϵ are normally distributed
Note: the values of x are fixed. We do not make a
distributional assumption about the predictor
variable.
5. Equal Variance: The variances along the line of
best fit remain similar.
1. Variables Type: Continuous (Interval or Ratio)
Inspect your Y and X relationship in scatterplot
4. Independence: You should have independence of
observations
High leverage, Large residuals, Large Influence
(David Dalpiaz, 2019)
Heteroscedasticity Homoscedasticity
Housing Dataset - Boston
71
https://raw.githubusercontent.com/rasbt/python-machine-learning-book-3rd-edition/master/ch10/housing.data.txt
Vahid Mirjalili and Sebastian Raschka. 2019. Python Machine Learning.
- Use URL to import data
- Dataset has no header
- Separator is spaces (one or more)
header = None,
sep=”s+”
Add column names
5 top
rows
Preparing Dataset
72
Can we predict Housing prizes (MEDV) in Boston?
Vahid Mirjalili and Sebastian Raschka. 2019. Python Machine Learning.
STEP 1. Create Features (Predictors) and Target (Response)
sets
STEP 2: Scale between 0 and 1
STEP 3: Split in Training 80% and Test 20% data
Linear Regression Model
73
STEP 1: Create Model
STEP2: Fit Model
STEP3: Evaluate
R2
Prediction
74
y_pred = model.predict(X_test)
Create a DataFrame
with correct labels
and predicted labels

Jupyter machine learning crash course

  • 1.
    Jupyter Notebook and MachineLearning 101 Materials: Karlijn Willems. 2019. Jupyter Notebook Tutorial. The definitive Guide. DataCamp Tairi Delgado. 2018. Hands-On Data Analytics for Beginners with Google Colaboratory Randy Olson.2018.Example Machine Learning Notebook. Sarah Guido and Andreas Müller. 2016. Introduction to Machine Learning with Python. O’Reilly Media, Inc Vahid Mirjalili and Sebastian Raschka. 2019. Python Machine Learning. Packt Publishing 1 Olga Scrivner @obscrivn https://www.linkedin.com/in/olgascrivner/ Research Scientist, CNS Visiting Lecturer, Data Science, IU Corporate Faculty, Data Analytics, HU
  • 2.
    Outline Part 1 –Introduction to Jupyter Notebook Part 2 – Working with data Break Part 3 – Machine Learning Fundamentals - Classification - Regression - Clustering 2
  • 3.
    Introduction to JupyterNotebook If you know Jupyter – skip this section 3
  • 4.
    What is JupyterNotebook? Notebook 4 Source: (Wiillems, 2019) Jupyter - Documents with Code and Text elements (text narrative, media, figure, equations) - Data analysis and description in one single document and in real time - An acronym for Julia, Python, and R (first programming languages supported by Jupyter Application) Text Code Figures
  • 5.
    Jupyter Notebook Components Editingand running documents via a web browser 5 Source: (Wiillems, 2019) Jupyter Notebook Kernel Dashboard A program running the user’s code A front-end interface showing input, output, files etc.
  • 6.
    History IPython – aweb notebook system was released 6 2011 2001 Fernandao Pérez started working on a notebook system 2014 Project Jupyter incorporated IPython Project JupyterLab is introduced2018 Source: Project Jupyter. 2018
  • 7.
    Jupyter Notebooks 7 ANACONDA GOOGLE COLAB -Collaboration - Must sign into Google account - Storage on Google Drive - No installation - Off-line - Local access - Anaconda Cloud for sharing - Installation comes with many useful libraries
  • 8.
  • 9.
    nbviewer Select Book Select Chapter1 Find Code blocks, Text narratives, Figures
  • 10.
  • 11.
    IPYNB 11 .ipynb file isa text file that describes the contents of your notebook in a format called JSON.
  • 12.
  • 13.
    Anaconda 13 - Install Anaconda(Python distribution for data science with popular libraries and tools) - Download the latest version of Anaconda for Python 3 https://anaconda.org/ Anaconda installs both Python and Jupyter Notebook
  • 14.
    Anaconda Navigator 14 Search forthe Anaconda Navigator Icon Start Menu: Type – anaconda navigator
  • 15.
    JupyterLab Interface 15 source: https://jupyterlab.readthedocs.io/en/latest/user/interface.html FileBrowser Current Running Notebooks Open Tabs Collapsible Left menu – on Click
  • 16.
    Create a NewFolder 16 Rename Folder Select Folder (Click on the Folder) New File New Folder Upload Check your path Select Desktop or Documents
  • 17.
    Create a NewFile Notebook 17 New File New Folder Upload Select New File Select Notebook Python 3 Rename (Right Click)
  • 18.
  • 19.
    6. RUN cell1 7. RUN cell 2 Cells: Practice 19 1. Create NEW cell 2. Type first_name single or double quotes cell is highlighted when selected Numbers – the order of execution 3. Create NEW cell 4. Type print(first_name) – second line 5. Click inside the first cell
  • 20.
    Copying and DeletingCells: Practice 20 1. Make a new cell 2. Copy this cell 3. Paste this cell 4. Place the cursor in your first copy 5. Delete this cell
  • 21.
    Quiz Question! 21 What isthe difference between two brackets? Cell is executed Cell is NOT executed
  • 22.
    Code and Markdown 22 Wehave used code so far (it is selected by default). 1. Create a NEW cell 2. Click on the drop-down menu 3. Select Markdown
  • 23.
    Markdown Headers: Practice 23 1.Create a new cell 2. Select Markdown (instead of Code) 3. Create a header # My Report 4. Run 5. Create a new cell 6. Create a sub-header ## Introduction 7. Run
  • 24.
    Files for Workshop 1.Download files from https://languagevariationsuite.wordpress.com/ 2. Place these files into today’s Folder for CrashCourse 3. Go back to JupyterLab (Left Menu), Find Exploring Google Collaboration notebook (orange icon) – double click on the file name 24Green – Running Notebook Orange – ipynb files You might be asked first time to choose Kernel – Select Python 3
  • 25.
    Can we keepboth files open? YES! 25 1. Select a file you want to add to the main dashboard. 2. Drag and Drop the file 3. Adjust the width as needed
  • 26.
    Markdown: Practice 26 1. Clickinside the Markdown cell to view the structure (Google colab notebook) 2. Reproduce the code in your own Notebook (Until Latex Tutorial) Step 1. Delete your current cells Step 2. Create new cell Step 3. Select Markdown Step 4. Type the content Step 5. Run Latex Tutorial – homework J When Done – Save your Notebook and Close
  • 27.
  • 28.
    Python Libraries pandas 28 matplotlib NumPy a libraryfor data wrangling and analysis. The main data structure is Data Frame (similar to a spreadsheet) pd a common short name import pandas as pd Instead of pandas.DataFrame() you can use pd.DataFrame() a library for linear algebra and scientific computing. The main data structure is array (matrix). np import numpy as np a scientific plotting library Sarah Guido and Andreas Müller. 2016. Introduction to Machine Learning with Python. Chapter 1. %matplotlib inline import matplotlib.pyplot as plt recommended to view figures in the browser plt
  • 29.
    Import Python Libraries 1.Create a new notebook 2. Drag and drop Exploring your Data.ipynb 3. You should have a split screen with your file and Exploring Data 4. Import libraries (copy/paste or type in) as shown into your notebook and Run 29
  • 30.
    Import Dataset: iris Doubleclick on dataset 30 You can switch between tabs Import dataset setosa, versicolor, virginica petals, sepals – length, width
  • 31.
    Data Description –Numeric Data 31 describe() Returns count, mean, sd, min, max, quantiles Something is wrong datasetname.describe() Which column is missing?
  • 32.
    Data Description –Categorical Data 32 value_counts() Returns counts for categorical values Something is wrong datasetname[‘columnname’].value_counts() count() Observations
  • 33.
    Data Cleaning replace() 33 replace byIris-versicolor replace by Iris-setosa datasetname.replace(expression, substitution) datasetname[‘column’].replace(expression, substitution) Why CLASS has quotes?
  • 34.
    Missing Data 34 isnull() returnsfalse or true isnull().sum() returns counts of NA datasetname.isnull().sum() Options: 1. Delete all 5 records 2. Replace 5 records with another valueLet’s try both!
  • 35.
  • 36.
    Replacing NA byMEAN 36 1. Calculate MEAN for petal_width_cm 2. Replace missing values by MEAN np.mean() Do you remember what NP is? np.mean(datasetname[‘columnname’]) fillna() datasetname[‘columnname’].fillna(value) Create a new dataset (copy of iris) Replace values by mean inplace = changes will modify the current dataset
  • 37.
    Saving New Files 37 Let’ssave clean iris_drop and iris_fill as two new csv files (for future use) to_csv() datasetname.to_csv(“filename.csv”, index=False) Remove index numbers from the CSV
  • 38.
    How do Iknow if a library is installed by default in Anaconda? 38 Back to Anaconda Navigator Look for a library in Installed or Not installed Use a search window Practice: Find the library - seaborn
  • 39.
    Data Visualization 39 a libraryfor statistical visualization pairplot() sb.pairplot(datasetname, hue = ‘columnname’) Diagonal – the distribution of each variable Off-diagonal – scatterplots between two variables
  • 40.
  • 41.
    Boxplot – DoWe Have Outliers? 41 boxplot(x,y) datasetname.boxplot(by, columnname) What is PLT? Remember – you can use single or double quotes What can we tell about Outlier?
  • 42.
    Selecting Outlier 42 (Condition 1)& (Condition 2) IRIS-VERSICOLORand iris_drop[‘sepal_length_cm’ <1] iris[‘class’ ==’Iris-versicolor’]&iris_drop
  • 43.
    Removing Outliers 43 ~ tilde= NOT dataset that is NOT (Condition 1 & Condition 2) iris_drop.to_csv("iris_drop.csv", index = False) save the current version of iris-drop
  • 44.
  • 45.
    What is MachineLearning? Extracting Knowledge from data 45 (Google Developers. Machine Learning. 2020) source: Aleksii Kharkovyna. 2019. A Beginner’s Guide to Data Science. AI – Machine’s ability to solve problems ML – Machine’s ability to learn DL – a technique for ML imitating brain neural network Michael Garbade. 2018. Clearing the Confusion : AI vs Machine Learning vs Deep Learning Differences
  • 46.
    Types of MachineLearning 46 Input and Output provided during training (learning) No training output provided. The purpose is to model underlying structure of data Modeling the relationship between measured data (INPUT) and labels (OUTPUT) The model is applied to new data to predict a LABEL (e.g. spam /no-spam) Discrete Labels Continuous Values JAkeVaderPlas. 2016. Python Data Science Handbook. O’Reilly Media N oLabels N oLabels
  • 47.
    Scikit-Learn 47 Python modules forMachine Learning and Data Miningscikit-learn https://scikit-learn.org/stable/
  • 48.
  • 49.
    Classification 49 Labels Measurements Can we predictspecies of flower from 4 measurements (sepal length and width, petal length and width)? Sarah Guido and Andreas Müller. 2016. Introduction to Machine Learning with Python
  • 50.
    Training and TestingData 50 The sample of data used for model evaluation: the model has never seen this set Test Data or Hold-Out Set Training Data or Training Set The sample of data used to build/train models ~80% ~20% Naming Notation: X - dataset y - labels X_train X_test y_train y_test Sarah Guido and Andreas Müller. 2016. Introduction to Machine Learning with Python
  • 51.
    Splitting into Trainingand Test Data 51 Step 1: Split dataset into FEATURES (X) and LABELS (y) Removing the column ‘class’ Step 2: Split into 80% training and 20% test data Sarah Guido and Andreas Müller. 2016. Introduction to Machine Learning with Python Sergii Boiko. 2018. Pandas Axis Explained. train_test_split() 20% test data labelsfeatures
  • 52.
    Decision Tree Classification 52 JAkeVaderPlas.2016. Python Data Science Handbook. O’Reilly Media A decision tree splits data iteratively and assigns a binary (yes/no) or numeric value label Step 1: Create the classifier Step 2: Train (FIT) the classifier Jebaseelan Ravi. 2018. Machine Learning Iris Classification. Step 3: Validate the classifier Remember we must use TEST data (unseen) for validation
  • 53.
    Prediction Evaluation 53 Step 4:Make predictions with Test data Step 5: Evaluate predictions What is your score?
  • 54.
    Making Predictions –New Array 54 We found an Iris with a sepal length = 5cm, a sepal width = 2.9cm, a petal length = 1cm, and a petal width = 0.2. NumPy (Numerical Python) – mathematical computation on arrays and matrices Multidimensional arrays with Homogenous elements (usually integers) Compare: Data Frame – 2x2 with heterogenous elements A 1D array is a vector: the shape is just the number of components. A 2D array is a matrix: the shape is (number of rows, number of columns). A 3D array: the shape is (number of frames, rows in each frame, columns in each frame)
  • 55.
    Arrays 55 1D Array 2D Array Twoarrays with 3 elements each One array with 3 elements 3D Array Three frames with 2 arrays and 2 elements each
  • 56.
    Arrays: Practice 56 Check thedimensions for X_test, X_train, y_test, y_train: - how may columns are in X_train? - What is the size of X_train and X_test? - How many labels are in y_test and y_train? - Which datasets are 2D and which ones are 1D? 2D 1D
  • 57.
  • 58.
    Unsupervised Learning 58 Grouping observationsinto a number of groups (k-groups) based on similar characteristics - Discovering underlying patterns - Grouping similar data together - Finding meaningful structure “learning by observation”: no label information is known Source: CS 412 Jiawei Han (2018)
  • 59.
  • 60.
    K-Means Cluster 60 K Means– 1) Choosing a number of K clusters and 2) Assigning Centroids (mean) We are using the entire data (without split) and we do not need labels.
  • 61.
    Predict Clusters 61 Let’s createa scatterplot with new labels! Create a new column Replace Integers by Species names Use labels for color coding Note: While similar to Classification, the labels have no a priori meaning Sarah Guido and Andreas Müller. 2016. Introduction to Machine Learning with Python Label assignment is random. you can use your own labels
  • 62.
  • 63.
    Hierarchical Cluster 63 Sarah Guidoand Andreas Müller. 2016. Introduction to Machine Learning with Python Partitional Clustering – NO overlapping points (K-Means) Hierarchical– overlapping points Dendrogram based on similarity distance based on dissimilarity distance KMeans – we measure a distance between two pairs of observations Hierarchical Cluster – we measure a distance between two clusters
  • 64.
    Hierarchical Cluster –Linkage Methods 64 UC Business Analytics R Programming Guide. Hierarchical Cluster Analysis. Ward – a linkage method that minimizes the total within-cluster variance We do not need labels for clustering but we will use them for dendrogram plotting.
  • 65.
    Summary 65 Sarah Guido andAndreas Müller. 2016. Introduction to Machine Learning with Python Let’s save your Figure!
  • 66.
  • 67.
  • 68.
    Regression 68 Vahid Mirjalili andSebastian Raschka. 2019. Python Machine Learning. Packt Publishing Modeling the relationship between one or multiple features (independent variables or predictors) a target (response or predicted)
  • 69.
    Regression Types 69 Independent Variables Regression Line Shape Dependent variable Simple Multiple 1Independent > 1 Independent Linear Continuous Linear Ridge Highly correlated Logistic Binary Nominal > 2 categories Poisson Count Quadratic Curvilinear Logistic Stepwise Identification of best variables Lasso Ordinal Ordered response Multivariate > 1 dependentCh.4 Multivariate Data Analysis. Joseph Hair et al. 2010. Pearson
  • 70.
    Linear Regression Assumptions 70 Ch.4Multivariate Data Analysis. Joseph Hair et al. 2010. Pearson 3. Outliers: There should be no significant outliers (Ch.13 Applied Statistics in R. David Dalpiaz) 2. Linear: The relationship between Y and x is linear Normal: The errors ϵ are normally distributed Note: the values of x are fixed. We do not make a distributional assumption about the predictor variable. 5. Equal Variance: The variances along the line of best fit remain similar. 1. Variables Type: Continuous (Interval or Ratio) Inspect your Y and X relationship in scatterplot 4. Independence: You should have independence of observations High leverage, Large residuals, Large Influence (David Dalpiaz, 2019) Heteroscedasticity Homoscedasticity
  • 71.
    Housing Dataset -Boston 71 https://raw.githubusercontent.com/rasbt/python-machine-learning-book-3rd-edition/master/ch10/housing.data.txt Vahid Mirjalili and Sebastian Raschka. 2019. Python Machine Learning. - Use URL to import data - Dataset has no header - Separator is spaces (one or more) header = None, sep=”s+” Add column names 5 top rows
  • 72.
    Preparing Dataset 72 Can wepredict Housing prizes (MEDV) in Boston? Vahid Mirjalili and Sebastian Raschka. 2019. Python Machine Learning. STEP 1. Create Features (Predictors) and Target (Response) sets STEP 2: Scale between 0 and 1 STEP 3: Split in Training 80% and Test 20% data
  • 73.
    Linear Regression Model 73 STEP1: Create Model STEP2: Fit Model STEP3: Evaluate R2
  • 74.
    Prediction 74 y_pred = model.predict(X_test) Createa DataFrame with correct labels and predicted labels