Hands-on ML - CH1

1
Machine
Learning
Landscape
Taka Wang
20210719

Deep
Learning
Neural Nets
Machine
Learning
Artiﬁcial
Intelligence
Dozens of
different ML
methods
2

3
Difference between deep learning and usual ML

Why use machine learning
Traditional Approach Machine Learning Approach
Learning from data
4
Email Spam

Automatically adapting to change
It's easier to automate than traditional methods
5

Machine Learning can help humans learn
6

When to use machine learning?
● Exists some underlying pattern to be learned
● Performance measure can be improved
● But no programmable deﬁnition
● There is data about the pattern
● ...
7
Ref: NTU 林軒田教授機器學習基石

Example of Applications
★ 分析產線線的產品圖片，自動分類。
○ CNNs (Classiﬁcation)
★ 在腦部腫瘤照片內找出腫瘤
○ CNNs (Segmentation)
★ 自動分類新聞文章/論壇評論
○ NLP (Text Classiﬁcation)
○ RNNs, CNNs, Transformers
★ 自動摘要長篇文章
○ NLP (Text Summarization)
○ RNNs, CNNs, Transformers
★ 建立自動聊天機器人或個人助理
○ NLP, NLU
★ 讓 APP 回應語音指示
○ Speech recognition
○ RNNs, CNNs, Transformer
★ 根據許多績效評量預估公司明年營收
○ Linear/Polynomial Regression
○ SVMs, Random Forests, ANNs
○ RNNs, CNNs, Transformers (過往)
★ 偵測信用卡詐欺
○ 異常偵測 (Anomaly Detection)
★ 根據顧客購買情況進行劃分，為不同市場類
別設計不同的行銷策略
○ 分群法 (Clustering)
★ 用清晰且富洞察力的圖表表示複雜的高維度
資料
○ Data visualization，降維
★ 根據過往的購買習慣推薦顧客可能感興趣的
產品
○ 推薦系統
★ 建立遊戲中的智慧機器人
○ 強化式學習 (Reinforcement Learning)
8

Types of Machine Learning Systems
★ levels of human supervision
○ supervised
○ unsupervised
○ semi-supervised
○ reinforcement learning
★ learn on-the-ﬂy
○ online learning
○ batch learning
★ predict via known instances
or learned pattern
○ instance-based
○ model-based
以上為本書提到的 criteria，還有其他分類方式。
這些 criteria 並沒有互斥，很多系統都是採用混合式的做法。
9

Classiﬁcation
Regression
Supervised Learning (監督式學習)

Unsupervised Learning (非監督式學習)
Clustering
Unlabeled training set
11
Customer

Anomaly Detection
(Visualization) Semantic Clusters
12
Credit card fraud

13
Association Rule
Learning

Semi-supervised Learning (半監督式學習)
14
有三角形與正方形兩種類別
圓形代表無標籤的樣本
Google Phtos

Reinforcement Learning (強化式學習)
15
Learning System

Reinforcement Learning (強化式學習)
16

Batch vs Online Learning
● 批次學習也稱為離線學習
● 舊資料+新資料 → 重新訓練
● 一段時間重新訓練
17
learning on-the-ﬂy
批次學習
mini-batch
Learning Rate
(遺忘速率)

Instance-based vs Model-based Learning
18
model-based learning
Instance-based learning
memory-based

Model-based Learning Example
19
Country GDP ($) Life Satisfaction
Hungary 12,240 4.9
Korea 27,195 5.8
France 37,675 6.5
Australia 50,962 7.3
United States 55,805 7.2
Vague Correlation

Model-based Learning Simple Linear Model
20
life_satiﬁcation = θ0
+ θ1
x GDP
Model*
Selection
Parameter
Measurement
cost function, utility function
Training

資料 vs 演算法的重要性資料的影響力出人意料
21
應該把錢花在演算法還是語料庫？

Main Challenges of Machine Learning
Traditional Approach Machine Learning Approach
23

Bad Data
24
1. Insufﬁcient Qualitity of Training Data
2. Nonrepresentative Training Data
a. too small → sampling noise
b. very large → sampling bias
3. Poor-Quality Data
a. Missing value
b. Outliers
4. Irrelevant Features
a. Feature Extraction
b. Feature Selection
Model Bias

1936 Literary Digest Poll
文學文摘 (The Literary Digest) 於事前進行的民
調。文學文摘郵寄1,000萬份問卷予其讀者，回收
230萬份。
文學文摘準確預測此前
5次總統選舉結果，並於
1936年10月31日宣布，他們預測共和黨候選人阿
爾夫·蘭登會在531張選舉人票中獲得370張選票
並勝出 (實際上他只獲得8張選舉人票)

Bad Algorithms
26
Regression
Classiﬁcation

Bad Algorithms - Overﬁtting the Training Data
27
★ Good training error
★ Bad error generalization error
★ 字母w開頭的國家滿意度都很高?!
➢ Simplify model
➢ Regularization
➢ Gather more data
➢ Reduce the noise
可能解法

Bad Algorithms - Overﬁtting the Training Data
28

Bad Algorithms - Underﬁtting the Training Data
29
★ Powerful model with more parameters
★ Better features (Feature engineering)
★ Reducing the constraints (regularization hyperparameter)

Stepping Back
★ 機器學習是讓電腦從資料中學習而變得更好，不是明確寫規則。
★ 機器學習有很多種類，監督與否，批次或線上，基於實例或模型。
★ 演算法是基於模型(參數)或者實例的，透過訓練集微調。
★ 訓練集太小，雜訊，沒有代表性。
★ 模型太簡單，模型太複雜。
30

思路
1. 如何評估模型效果？直接上線測試？
2. 如何選擇模型？
3. 選定模型後，如何調整超參數？
32

Model Selection
35
比較模型的 Testing Error (Generalization error)

Hyperparameter Tuning
實驗各種參數組合，得到最好的 Generalization Error
但實測還是不佳→可能已經對測試集優化了
36

Training Set Test Set
Validation Set
Holdout Validation
37
aka development set or dev set

Validation Set 遭遇的問題
驗證集切太大訓練集剩太少
無法代表驗證集與
訓練集合起來的效果
驗證集切太小評估很不準
38

Test Set
n-fold
cross-validation
Training Set
Validation Set

Common Supervised Learning Algorithms
41
★ K-Nearest Neighbors
★ Linear regression
★ Logistic regression
★ Support Vector Machines (SVMs)
★ Decision Tree
★ Random Forest
★ Neural Networks*

Common Unsupervised Learning Algorithms
42
★ Clustering
○ K-Means
○ DBSCAN
○ Hierarchical Cluster Analysis
(HCA)
★ Anomaly detection and novelty
detection
○ One-class SVM
○ Isolation Forest
★ Visualization and dimensionality
reduction
○ Principal Component Analysis
(PCA)
○ Kernel PCA
○ Locally-Linear Embedding (LLE)
○ t-distributed Stochastic
Neightboar Embedding (t-SNE)
★ Association rule learning
○ Apriori
○ Eclat

import numpy as np
import pandas as pd
import sklearn.linear_model # import sklearn.neighbors
# Load the data
oecd_bli = pd.read_csv(datapath + "oecd_bli_2015.csv", thousands=',')
gdp_per_capita = pd.read_csv(datapath + "gdp_per_capita.csv",thousands=',',delimiter='t', encoding='latin1',
na_values="n/a")
# Prepare the data
country_stats = prepare_country_stats(oecd_bli, gdp_per_capita)
X = np.c_[country_stats["GDP per capita"]]
y = np.c_[country_stats["Life satisfaction"]]
# Select a linear model
model = sklearn.linear_model.LinearRegression() # model = sklearn.neighbors.KNeighborsRegressor(n_neighbors=3)
# Train the model
model.fit(X, y)
# Make a prediction for Cyprus
X_new = [[22587]] # Cyprus' GDP per capita
print(model.predict(X_new)) # outputs [[ 5.96242338]]
Train a linear model using Scikit-Learn
43

訓練集測試集
驗證集
train-dev
網路爬回來的手機拍的

Hands-on ML - CH1

More Related Content

What's hot

Similar to Hands-on ML - CH1

More from Jamie (Taka) Wang

Recently uploaded

Hands-on ML - CH1