Jupyter machine learning crash course

Jupyter Notebook and
Machine Learning 101
Materials: Karlijn Willems. 2019. Jupyter Notebook Tutorial. The definitive Guide. DataCamp
Tairi Delgado. 2018. Hands-On Data Analytics for Beginners with Google Colaboratory
Randy Olson.2018.Example Machine Learning Notebook.
Sarah Guido and Andreas Müller. 2016. Introduction to Machine Learning with Python. O’Reilly Media, Inc
Vahid Mirjalili and Sebastian Raschka. 2019. Python Machine Learning. Packt Publishing
1
Olga Scrivner @obscrivn
https://www.linkedin.com/in/olgascrivner/
Research Scientist, CNS
Visiting Lecturer, Data Science, IU
Corporate Faculty, Data Analytics, HU

Outline
Part 1 – Introduction to Jupyter Notebook
Part 2 – Working with data
Break
Part 3 – Machine Learning Fundamentals
- Classification
- Regression
- Clustering
2

Introduction to Jupyter Notebook
If you know Jupyter – skip this section
3

What is Jupyter Notebook?
Notebook
4
Source: (Wiillems, 2019)
Jupyter
- Documents with Code and Text
elements (text narrative, media,
figure, equations)
- Data analysis and description in one
single document and in real time
- An acronym for Julia, Python, and R
(first programming languages
supported by Jupyter Application)
Text
Code Figures

Jupyter Notebook Components
Editing and running documents via a web browser
5
Source: (Wiillems, 2019)
Jupyter Notebook
Kernel Dashboard
A program running
the user’s code
A front-end interface showing
input, output, files etc.

History
IPython – a web notebook system was
released
6
2011
2001 Fernandao Pérez started working on a
notebook system
2014 Project Jupyter incorporated IPython
Project JupyterLab is introduced2018
Source: Project Jupyter. 2018

Jupyter Notebooks
7
ANACONDA
GOOGLE COLAB
- Collaboration
- Must sign into Google account
- Storage on Google Drive
- No installation
- Off-line
- Local access
- Anaconda Cloud for sharing
- Installation comes with
many useful libraries

Jupyter Project
8
http://www.jupyter.org

nbviewer
Select Book
Select Chapter 1
Find Code blocks, Text narratives, Figures

Notebook Structure
10
Text block
Code block
Section header

IPYNB
11
.ipynb file is a text file
that describes the
contents of your
notebook in a format
called JSON.

Running Jupyter Notebook
Anaconda
12

Anaconda
13
- Install Anaconda (Python distribution for data
science with popular libraries and tools)
- Download the latest version of Anaconda for
Python 3 https://anaconda.org/
Anaconda installs both Python and
Jupyter Notebook

Anaconda Navigator
14
Search for the Anaconda
Navigator Icon
Start Menu:
Type –
anaconda navigator

JupyterLab Interface
15
source: https://jupyterlab.readthedocs.io/en/latest/user/interface.html
File Browser
Current Running Notebooks
Open Tabs
Collapsible Left menu – on Click

Create a New Folder
16
Rename Folder
Select Folder (Click on the Folder)
New File New Folder Upload
Check your path
Select Desktop or Documents

Create a New File Notebook
17
New File New Folder Upload
Select New File
Select Notebook Python 3
Rename (Right Click)

Review
18
Save Add
cell
Delete
cell
Copy
cell
Stop
ExecutionRun
cell
Paste
cell

6. RUN cell 1
7. RUN cell 2
Cells: Practice
19
1. Create NEW cell
2. Type first_name
single or double quotes
cell is
highlighted when
selected
Numbers – the order of execution
3. Create NEW cell
4. Type print(first_name) – second line
5. Click inside the first cell

Copying and Deleting Cells: Practice
20
1. Make a new cell
2. Copy this cell
3. Paste this cell
4. Place the cursor in your first copy
5. Delete this cell

Quiz Question!
21
What is the difference between two brackets?
Cell is executed
Cell is NOT executed

Code and Markdown
22
We have used code so far (it is selected by default).
1. Create a NEW cell
2. Click on the drop-down menu
3. Select Markdown

Markdown Headers: Practice
23
1. Create a new cell
2. Select Markdown (instead of Code)
3. Create a header # My Report
4. Run
5. Create a new cell
6. Create a sub-header ## Introduction
7. Run

Files for Workshop
1. Download files from
https://languagevariationsuite.wordpress.com/
2. Place these files into today’s Folder for CrashCourse
3. Go back to JupyterLab (Left Menu), Find Exploring Google
Collaboration notebook (orange icon) – double click on the file
name
24Green – Running Notebook
Orange – ipynb files
You might
be asked
first time to
choose
Kernel –
Select
Python 3

Can we keep both files open? YES!
25
1. Select a file you want to add to the main dashboard.
2. Drag and Drop the file
3. Adjust the width as needed

Markdown: Practice
26
1. Click inside the Markdown cell to
view the structure (Google colab
notebook)
2. Reproduce the code in your own
Notebook (Until Latex Tutorial)
Step 1. Delete your current cells
Step 2. Create new cell
Step 3. Select Markdown
Step 4. Type the content
Step 5. Run
Latex Tutorial – homework J
When Done – Save your Notebook and Close

Python Libraries
pandas
28
matplotlib
NumPy
a library for data wrangling and analysis. The main data
structure is Data Frame (similar to a spreadsheet)
pd a common short name
import pandas as pd
Instead of pandas.DataFrame()
you can use pd.DataFrame()
a library for linear algebra and scientific computing. The main
data structure is array (matrix).
np import numpy as np
a scientific plotting library
Sarah Guido and Andreas Müller. 2016. Introduction to Machine Learning with Python. Chapter 1.
%matplotlib inline
import matplotlib.pyplot as plt
recommended to view
figures in the browser
plt

Import Python Libraries
1. Create a new notebook
2. Drag and drop Exploring your Data.ipynb
3. You should have a split screen with your file and Exploring Data
4. Import libraries (copy/paste or type in) as shown into your
notebook and Run
29

Import Dataset: iris
Double click on dataset
30
You can switch between tabs
Import dataset
setosa, versicolor, virginica
petals, sepals – length, width

Data Description – Numeric Data
31
describe() Returns count, mean, sd, min, max, quantiles
Something is wrong
datasetname.describe()
Which column is missing?

Data Description – Categorical Data
32
value_counts() Returns counts for categorical values
Something is wrong
datasetname[‘columnname’].value_counts()
count()
Observations

Data Cleaning
replace()
33
replace by Iris-versicolor
replace by Iris-setosa
datasetname.replace(expression, substitution)
datasetname[‘column’].replace(expression, substitution)
Why CLASS has quotes?

Missing Data
34
isnull() returns false or true
isnull().sum() returns counts of NA
datasetname.isnull().sum()
Options:
1. Delete all 5 records
2. Replace 5 records with another valueLet’s try both!

Dropping NA
35
Creating a new dataset

Replacing NA by MEAN
36
1. Calculate MEAN for petal_width_cm
2. Replace missing values by MEAN
np.mean() Do you
remember what
NP is?
np.mean(datasetname[‘columnname’])
fillna()
datasetname[‘columnname’].fillna(value)
Create a new dataset (copy of iris)
Replace values by mean
inplace = changes will modify the current dataset

Saving New Files
37
Let’s save clean iris_drop and iris_fill as two new csv files (for future use)
to_csv()
datasetname.to_csv(“filename.csv”, index=False)
Remove index numbers from the CSV

How do I know if a library is installed by default in Anaconda?
38
Back to Anaconda Navigator
Look for a library in Installed or Not installed Use a search window
Practice: Find the library - seaborn

Data Visualization
39
a library for statistical
visualization
pairplot()
sb.pairplot(datasetname, hue = ‘columnname’)
Diagonal – the distribution of each variable
Off-diagonal – scatterplots between two variables

Histograms
40
datasetname[‘columnname’].hist()hist()
Something is wrong!!

Boxplot – Do We Have Outliers?
41
boxplot(x,y)
datasetname.boxplot(by, columnname)
What is PLT?
Remember –
you can use
single or
double
quotes
What can we tell about Outlier?

Selecting Outlier
42
(Condition 1) & (Condition 2)
IRIS-VERSICOLORand
iris_drop[‘sepal_length_cm’ <1] iris[‘class’ ==’Iris-versicolor’]&iris_drop

Removing Outliers
43
~ tilde = NOT
dataset that is NOT (Condition 1 & Condition 2)
iris_drop.to_csv("iris_drop.csv", index = False)
save the current version of iris-drop

What is Machine Learning?
Extracting
Knowledge from data
45
(Google Developers. Machine Learning. 2020)
source: Aleksii Kharkovyna. 2019. A Beginner’s Guide to Data Science.
AI – Machine’s ability to solve problems
ML – Machine’s ability to learn
DL – a technique for ML imitating brain neural network
Michael Garbade. 2018. Clearing the Confusion : AI vs Machine Learning vs Deep Learning Differences

Types of Machine Learning
46
Input and Output
provided during
training (learning)
No training
output
provided. The
purpose is to
model
underlying
structure of
data
Modeling the
relationship between
measured data
(INPUT) and labels
(OUTPUT)
The model is applied
to new data to
predict a LABEL (e.g.
spam /no-spam)
Discrete
Labels
Continuous
Values
JAkeVaderPlas. 2016. Python Data Science Handbook. O’Reilly Media
N
oLabels
N
oLabels

Scikit-Learn
47
Python modules for Machine Learning and Data Miningscikit-learn
https://scikit-learn.org/stable/

Classification
49
Labels
Measurements
Can we predict species of flower from 4 measurements (sepal length and
width, petal length and width)?
Sarah Guido and Andreas Müller. 2016. Introduction to Machine Learning with Python

Training and Testing Data
50
The sample of data used for model evaluation: the
model has never seen this set
Test Data or
Hold-Out Set
Training Data
or Training Set
The sample of data used to build/train models
~80%
~20%
Naming Notation:
X - dataset
y - labels
X_train X_test
y_train y_test

Splitting into Training and Test Data
51
Step 1: Split dataset into FEATURES (X) and LABELS (y)
Removing the column ‘class’
Step 2: Split into 80% training and 20% test data
Sarah Guido and Andreas Müller. 2016. Introduction to Machine Learning with Python Sergii Boiko. 2018. Pandas Axis Explained.
train_test_split()
20% test data
labelsfeatures

Decision Tree Classification
52
JAkeVaderPlas. 2016. Python Data Science Handbook. O’Reilly Media
A decision tree splits data
iteratively and assigns a
binary (yes/no) or numeric
value label
Step 1: Create the classifier Step 2: Train (FIT) the classifier
Jebaseelan Ravi. 2018. Machine Learning Iris Classification.
Step 3: Validate the classifier
Remember we must use
TEST data (unseen) for
validation

Prediction Evaluation
53
Step 4: Make predictions with Test data
Step 5: Evaluate predictions
What is your score?

Making Predictions – New Array
54
We found an Iris with a sepal length = 5cm, a sepal width = 2.9cm, a petal length =
1cm, and a petal width = 0.2.
NumPy (Numerical Python) – mathematical
computation on arrays and matrices
Multidimensional arrays with Homogenous
elements (usually integers)
Compare: Data Frame – 2x2
with heterogenous elements
A 1D array is a vector: the shape is just the number of components.
A 2D array is a matrix: the shape is (number of rows, number of columns).
A 3D array: the shape is (number of frames, rows in each frame, columns in each frame)

Arrays
55
1D Array
2D Array
Two arrays
with 3
elements each
One array with
3 elements
3D Array
Three frames
with 2 arrays
and 2
elements each

Arrays: Practice
56
Check the dimensions for X_test, X_train, y_test, y_train:
- how may columns are in X_train?
- What is the size of X_train and X_test?
- How many labels are in y_test and y_train?
- Which datasets are 2D and which ones are 1D?
2D 1D

Unsupervised Learning
58
Grouping observations into a number of groups (k-groups) based on
similar characteristics
- Discovering underlying patterns
- Grouping similar data together
- Finding meaningful structure
“learning by observation”: no label information is known
Source: CS 412 Jiawei Han (2018)

K-Means Cluster
60
K Means – 1) Choosing a number of
K clusters and 2) Assigning Centroids
(mean)
We are using the entire data (without split)
and we do not need labels.

Predict Clusters
61
Let’s create a scatterplot with new labels!
Create a new column
Replace Integers by
Species names
Use labels for color coding
Note: While similar to Classification, the labels have no a priori meaning
Label assignment is random.
you can use
your own
labels

Scatterplots
62
Old LabelsPredicted Labels

Hierarchical Cluster
63
Partitional Clustering – NO overlapping points (K-Means)
Hierarchical– overlapping points
Dendrogram
based on similarity distance
based on dissimilarity distance
KMeans – we measure a distance
between two pairs of observations
Hierarchical Cluster – we measure a
distance between two clusters

Hierarchical Cluster – Linkage Methods
64
UC Business Analytics R Programming Guide. Hierarchical Cluster Analysis.
Ward – a linkage method
that minimizes the total
within-cluster variance
We do not need labels for clustering
but we will use them for dendrogram
plotting.

Summary
65
Let’s save your Figure!

Summary
66
Doaa Taha. 2018. Lecture slides

Regression
68
Vahid Mirjalili and Sebastian Raschka. 2019. Python Machine Learning. Packt Publishing
Modeling the relationship between
one or multiple features
(independent variables
or predictors)
a target (response or
predicted)

Regression Types
69
Independent
Variables
Regression Line
Shape
Dependent
variable
Simple
Multiple
1 Independent
> 1 Independent
Linear Continuous
Linear
Ridge Highly correlated
Logistic Binary
Nominal > 2 categories
Poisson Count
Quadratic
Curvilinear
Logistic
Stepwise Identification of
best variables
Lasso
Ordinal Ordered response
Multivariate > 1 dependentCh.4 Multivariate Data Analysis. Joseph Hair et al. 2010. Pearson

Linear Regression Assumptions
70
Ch.4 Multivariate Data Analysis. Joseph Hair et al. 2010. Pearson
3. Outliers: There should be no significant outliers
(Ch.13 Applied Statistics in R. David Dalpiaz)
2. Linear: The relationship between Y and x is linear
Normal: The errors ϵ are normally distributed
Note: the values of x are fixed. We do not make a
distributional assumption about the predictor
variable.
5. Equal Variance: The variances along the line of
best fit remain similar.
1. Variables Type: Continuous (Interval or Ratio)
Inspect your Y and X relationship in scatterplot
4. Independence: You should have independence of
observations
High leverage, Large residuals, Large Influence
(David Dalpiaz, 2019)
Heteroscedasticity Homoscedasticity

Housing Dataset - Boston
71
https://raw.githubusercontent.com/rasbt/python-machine-learning-book-3rd-edition/master/ch10/housing.data.txt
Vahid Mirjalili and Sebastian Raschka. 2019. Python Machine Learning.
- Use URL to import data
- Dataset has no header
- Separator is spaces (one or more)
header = None,
sep=”s+”
Add column names
5 top
rows

Preparing Dataset
72
Can we predict Housing prizes (MEDV) in Boston?
Vahid Mirjalili and Sebastian Raschka. 2019. Python Machine Learning.
STEP 1. Create Features (Predictors) and Target (Response)
sets
STEP 2: Scale between 0 and 1
STEP 3: Split in Training 80% and Test 20% data

Linear Regression Model
73
STEP 1: Create Model
STEP2: Fit Model
STEP3: Evaluate
R2

Prediction
74
y_pred = model.predict(X_test)
Create a DataFrame
with correct labels
and predicted labels

Jupyter machine learning crash course

More Related Content

What's hot

Similar to Jupyter machine learning crash course

More from Olga Scrivner

Recently uploaded

Jupyter machine learning crash course