1
Machine
Learning
Landscape
Taka Wang
20210719
Deep
Learning
Neural Nets
Machine
Learning
Artificial
Intelligence
Dozens of
different ML
methods
2
3
Difference between deep learning and usual ML
Why use machine learning
Traditional Approach Machine Learning Approach
Learning from data
4
Email Spam
Automatically adapting to change
It's easier to automate than traditional methods
5
Machine Learning can help humans learn
6
When to use machine learning?
● Exists some underlying pattern to be learned
● Performance measure can be improved
● But no programmable definition
● There is data about the pattern
● ...
7
Ref: NTU 林軒田教授 機器學習基石
Example of Applications
★ 分析產線線的產品圖片,自動分類。
○ CNNs (Classification)
★ 在腦部腫瘤照片內找出腫瘤
○ CNNs (Segmentation)
★ 自動分類新聞文章/論壇評論
○ NLP (Text Classification)
○ RNNs, CNNs, Transformers
★ 自動摘要長篇文章
○ NLP (Text Summarization)
○ RNNs, CNNs, Transformers
★ 建立自動聊天機器人或個人助理
○ NLP, NLU
★ 讓 APP 回應語音指示
○ Speech recognition
○ RNNs, CNNs, Transformer
★ 根據許多績效評量預估公司明年營收
○ Linear/Polynomial Regression
○ SVMs, Random Forests, ANNs
○ RNNs, CNNs, Transformers (過往)
★ 偵測信用卡詐欺
○ 異常偵測 (Anomaly Detection)
★ 根據顧客購買情況進行劃分,為不同市場類
別設計不同的行銷策略
○ 分群法 (Clustering)
★ 用清晰且富洞察力的圖表表示複雜的高維度
資料
○ Data visualization,降維
★ 根據過往的購買習慣推薦顧客可能感興趣的
產品
○ 推薦系統
★ 建立遊戲中的智慧機器人
○ 強化式學習 (Reinforcement Learning)
8
Types of Machine Learning Systems
★ levels of human supervision
○ supervised
○ unsupervised
○ semi-supervised
○ reinforcement learning
★ learn on-the-fly
○ online learning
○ batch learning
★ predict via known instances
or learned pattern
○ instance-based
○ model-based
以上為本書提到的 criteria,還有其他分類方式。
這些 criteria 並沒有互斥,很多系統都是採用混合式的做法。
9
Classification
Regression
Supervised Learning (監督式學習)
Unsupervised Learning (非監督式學習)
Clustering
Unlabeled training set
11
Customer
Unsupervised Learning (非監督式學習)
Anomaly Detection
(Visualization) Semantic Clusters
12
Credit card fraud
Unsupervised Learning (非監督式學習)
13
Association Rule
Learning
Semi-supervised Learning (半監督式學習)
14
有 三角形 與 正方形 兩種類別
圓形 代表無標籤的樣本
Google Phtos
Reinforcement Learning (強化式學習)
15
Learning System
Reinforcement Learning (強化式學習)
16
Batch vs Online Learning
● 批次學習也稱為離線學習
● 舊資料+新資料 → 重新訓練
● 一段時間重新訓練
17
learning on-the-fly
批次學習
mini-batch
Learning Rate
(遺忘速率)
Instance-based vs Model-based Learning
18
model-based learning
Instance-based learning
memory-based
Model-based Learning Example
19
Country GDP ($) Life Satisfaction
Hungary 12,240 4.9
Korea 27,195 5.8
France 37,675 6.5
Australia 50,962 7.3
United States 55,805 7.2
Vague Correlation
Model-based Learning Simple Linear Model
20
life_satification = θ0
+ θ1
x GDP
Model*
Selection
Parameter
Measurement
cost function, utility function
Training
資料 vs 演算法的重要性 資料的影響力出人意料
21
應該把錢花在演算法還是語料庫?
Break
22
Main Challenges of Machine Learning
Traditional Approach Machine Learning Approach
23
Bad Data
24
1. Insufficient Qualitity of Training Data
2. Nonrepresentative Training Data
a. too small → sampling noise
b. very large → sampling bias
3. Poor-Quality Data
a. Missing value
b. Outliers
4. Irrelevant Features
a. Feature Extraction
b. Feature Selection
Model Bias
1936 Literary Digest Poll
文學文摘 (The Literary Digest) 於事前進行的民
調。文學文摘郵寄1,000萬份問卷予其讀者,回收
230萬份。
文學文摘準確預測此前
5次總統選舉結果,並於
1936年10月31日宣布,他們預測共和黨候選人阿
爾夫·蘭登會在531張選舉人票中獲得370張選票
並勝出 (實際上他只獲得8張選舉人票)
Bad Algorithms
26
Regression
Classification
Bad Algorithms - Overfitting the Training Data
27
★ Good training error
★ Bad error generalization error
★ 字母w開頭的國家滿意度都很高?!
➢ Simplify model
➢ Regularization
➢ Gather more data
➢ Reduce the noise
可能解法
Bad Algorithms - Overfitting the Training Data
28
Bad Algorithms - Underfitting the Training Data
29
★ Powerful model with more parameters
★ Better features (Feature engineering)
★ Reducing the constraints (regularization hyperparameter)
Stepping Back
★ 機器學習是讓電腦從資料中學習而變得更好,不是明確寫規則。
★ 機器學習有很多種類,監督與否,批次或線上,基於實例或模型。
★ 演算法是基於模型(參數)或者實例的,透過訓練集微調。
★ 訓練集太小,雜訊,沒有代表性。
★ 模型太簡單,模型太複雜。
30
如何評估
模型
31
思路
1. 如何評估模型效果?直接上線測試?
2. 如何選擇模型?
3. 選定模型後,如何調整超參數?
32
Dataset
33
Traing Set and Test Set
34
Model Selection
35
比較模型的 Testing Error (Generalization error)
Hyperparameter Tuning
實驗各種參數組合,得到最好的 Generalization Error
但實測還是不佳→可能已經對測試集優化了
36
Training Set Test Set
Validation Set
Holdout Validation
37
aka development set or dev set
Validation Set 遭遇的問題
驗證集切太大 訓練集剩太少
無法代表驗證集與
訓練集合起來的效果
驗證集切太小 評估很不準
38
Test Set
n-fold
cross-validation
Training Set
Validation Set
遺珠之憾
40
Common Supervised Learning Algorithms
41
★ K-Nearest Neighbors
★ Linear regression
★ Logistic regression
★ Support Vector Machines (SVMs)
★ Decision Tree
★ Random Forest
★ Neural Networks*
Common Unsupervised Learning Algorithms
42
★ Clustering
○ K-Means
○ DBSCAN
○ Hierarchical Cluster Analysis
(HCA)
★ Anomaly detection and novelty
detection
○ One-class SVM
○ Isolation Forest
★ Visualization and dimensionality
reduction
○ Principal Component Analysis
(PCA)
○ Kernel PCA
○ Locally-Linear Embedding (LLE)
○ t-distributed Stochastic
Neightboar Embedding (t-SNE)
★ Association rule learning
○ Apriori
○ Eclat
import numpy as np
import pandas as pd
import sklearn.linear_model # import sklearn.neighbors
# Load the data
oecd_bli = pd.read_csv(datapath + "oecd_bli_2015.csv", thousands=',')
gdp_per_capita = pd.read_csv(datapath + "gdp_per_capita.csv",thousands=',',delimiter='t', encoding='latin1',
na_values="n/a")
# Prepare the data
country_stats = prepare_country_stats(oecd_bli, gdp_per_capita)
X = np.c_[country_stats["GDP per capita"]]
y = np.c_[country_stats["Life satisfaction"]]
# Select a linear model
model = sklearn.linear_model.LinearRegression() # model = sklearn.neighbors.KNeighborsRegressor(n_neighbors=3)
# Train the model
model.fit(X, y)
# Make a prediction for Cyprus
X_new = [[22587]] # Cyprus' GDP per capita
print(model.predict(X_new)) # outputs [[ 5.96242338]]
Train a linear model using Scikit-Learn
43
Data Mismatch
44
訓練集 測試集
驗證集
train-dev
網路爬回來的 手機拍的
No Free Lunch Theorem
46
謝謝
47

Hands-on ML - CH1