Techniques in
Deep Learning
Sourya Dey
Outline
• Activation functions
• Cost function
• Optimizers
• Regularization
• Parameter initialization
• Normalization
• Data handling
• Hyperparameter selection
The Big Picture
Outline
• Activation functions
• Cost function
• Optimizers
• Regularization
• Parameter initialization
• Normalization
• Data handling
• Hyperparameter selection
Activation Functions
Recall:
Linear
Non-Linearity
Non-linearity is required to approximate any arbitrary function
Squashing activations
Sigmoid
Hyperbolic Tangent
(just rescaled sigmoid)
Vanishing gradients
Recall from BP (take example L=3)
All these numbers are <= 0.25, so
gradients become very small!!
ReLU family of activations
x >= 0 x < 0
Rectified Linear Unit (ReLU) x 0
Exponential Linear Unit (ELU) x α(ex - 1)
Leaky ReLU x αx
Biologically inspired - neurons firing vs not firing
Solves vanishing gradient problem
Non-differentiable at 0, replace with anything in [0,1]
ReLU can die if x<0
Leaky ReLU solves this, but inconsistent results
ELU saturates for x<0, so less resistant to noise
Clevert, Djork-Arné; Unterthiner, Thomas; Hochreiter, Sepp (2015-11-23). "Fast and
Accurate Deep Network Learning by Exponential Linear Units (ELUs)". arXiv:1511.07289
Maxout networks - Generalization of ReLU
Normally:
For maxout:
Learns the activation function itself
Better approximation power
Takes more computation Goodfellow, Ian J.; Warde-Farley, David; Mirza, Mehdi; Courville, Aaron; Bengio, Yoshua
(2013). "Maxout Networks". JMLR Workshop and Conference Proceedings. 28 (3): 1319–1327.
Example of maxout
k = 2, N(l) = 5
ReLU is a special case of maxout with k=2 and 1 set of (W, b) = all 0s
Which activation to use?
Don’t use sigmoid
Use ReLU
If too many units are dead, try other activations
And watch out for new activations (or maybe invent a new one) - deep learning moves fast!
Output layer activation - Softmax
Network output is a
probability distribution!
Extending logistic regression to multiple classes
Compare to ideal
output probability
distribution
Outline
• Activation functions
• Cost function
• Optimizers
• Regularization
• Parameter initialization
• Normalization
• Data handling
• Hyperparameter selection
Cross-entropy Cost
For binary labels, this reduces to:
Ground truth labels Network outputs
Minimizing cross-entropy is the
same as minimizing KL-
divergence between the
probability distributions y and a
The one-hot case
The correct class r is 1, everything else is 0
Class 0 incorrect
Class 1 incorrect
Class N(L)-1 incorrect
Class r correct
Softmax and cross-entropy with one-hot labels
This makes beautiful sense as the error vector!
Recall:
Combining:
Example
But remember: We are interested
in cost as a function of W, b
Quadratic Cost (MSE)
Mainly used for regression, or cases where
y and a are not probability distributions
Which cost function to use?
Use cross-entropy for classification tasks
It corresponds to Maximum Likelihood of a
categorical distribution
Outline
• Activation functions
• Cost function
• Optimizers
• Regularization
• Parameter initialization
• Normalization
• Data handling
• Hyperparameter selection
Level sets
Given
A level set is a set of points in the domain
at which function value is constant
Example
k = 1
k = 9
k = 4
k = 0
Level sets and gradient
The gradient (and its negative) are
always perpendicular to the level set
Recall: Gradient at any point is the
direction of maximum increase of
the function at that point.
Gradient cannot have a component
along the level set
Conditioning
Informally, conditioning measures
how non-circular the level sets are
Ill-conditioned: Much more sensitive to w2
Formally…
Hessian: Matrix of double derivatives
Given
For continuous functions:
Hessian is symmetric
Conditioning in terms of Hessian
Conditioning is measured as the condition
number of the Hessian of the cost function
Examples:
𝜅 = 1 => Well-conditioned
𝜅 = 9 => Ill-conditioned
By the way, don’t expect actual cost Hessians to be so simple…
I’m using d for
eigenvalues
The Big Picture
What is an Optimizer?
A strategy or algorithm to reduce the cost
function and make the network learn
Eg: Gradient descent
Gradient descent: Well-conditioned
Well-conditioned level sets
lead to fast convergence
Gradient descent: Ill-conditioned
Ill-conditioned level sets
lead to slow convergence
Momentum
Equivalent to smoothing the update by a low pass filter
Normal update: Momentum update:
How much of past history should be applied, typically ~0.9
Why Momentum?
Gradient descent with momentum
converges quickly even when ill-conditioned
Alternate way to apply momentum
Usual momentum update decouples α and η
Alternative update couples them:
Nesterov Momentum
Normal update: Momentum update: Nesterov Momentum update:
Nesterov, Y. (1983). A method of solving a convex programming problem
with convergence rate O(1/k2). Soviet Mathematics Doklady, 27, 372–376.
Correction factor applied to momentum
May give faster convergence
Pic courtesy: Geoffrey Hinton’s slides: http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf
Adaptive Optimizers
Learning rate should be reduced with time. We don’t want to overshoot the
minimum point and oscillate.
Learning rate should be small for sensitive parameters, and vice-versa.
Momentum factor should be increased with time to get smoother updates.
Adaptive optimizers:
• Adagrad
• Adadelta
• RMSprop
• Adam
• Adamax
• Nadam
• Other ‘Ada’-sounding words you can think of…
https://www.youtube.com/watch?v=2lUFM8yTtUc
RMSprop
Scale each gradient by an exponentially
weighted moving average of its past history
Default ρ = 0.9
ϵ is just a small number to prevent division by
0. Can be machine epsilon
Hinton, G. Neural networks for machine learning. Coursera, video lectures, 2012.
http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf
Sensitive parameters get low η
Adam
RMSprop with momentum and bias correction
Momentum
Exponentially weighted
moving average of past history
At time step (t+1):
Bias corrections to make
Defaults:
η=0.001, ρ1 = 0.9, ρ2 = 0.999
D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in Proc. Int. Conf. Learning Representations (ICLR), 2014.
Which optimizer to use?
opt = SGD(eta=0.01, mom=0.95, decay=1e-5, nesterov=False) opt = Adam()
Machine learning vs Simple Optimization
Simple Optimization Machine Learning
Goal Minimize f(x) Minimize C(w) on test data
Typical problem size A few variables A few million variables
Approach Gradient descent
2nd order methods
Gradient descent on training data
(2nd order methods not feasible)
Stopping criterion x* = argmin f(x) ??
No access to test data
Machine learning is about generalizing well on test data
Outline
• Activation functions
• Cost function
• Optimizers
• Regularization
• Parameter initialization
• Normalization
• Data handling
• Hyperparameter selection
Regularization
Regularization is any modification
intended to reduce generalization
(test) cost, but not training cost
The problem of overfitting
Ctrain = 0.385
Ctest = 0.49
Ctrain = 0.012
Ctest = 0.019
Ctrain = 0
Ctest = 145475
Underfitting: Order = 1 Perfect: Order = 2 Overfitting: Order = 9
Bias-Variance Tradeoff
Say the (unknown) optimum value of some parameter is w*, but the value we get from training is w
For a given MSE, bias
trades off with variance
Reduce Bias: Make expected value of estimate close to true value
Reduce Variance: Estimate bounds should be tight
Bias-Variance Tradeoff
Pic courtesy: I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016, http://www.deeplearningbook.org.
Estimator not
good enough
Estimator too
sensitive
Minibatches
Estimate true gradient using
data average of sample
gradients of training data
Batch gradient descent
Best possible approximation to true gradient
Minibatch gradient descent
Noisy approximation to true gradient
A Note on Terminology
Batch size : Same as minibatch size M
Batch GD : Use all samples
Minibatch GD : 1 < M < Ntrain
Online learning : M = 1
Stochastic GD : 1 <= M < Ntrain
Basically the same as minibatch GD.
Optimizers - Why minibatches?
Batch GD gets to
local minimum
 Noise in minibatch GD may
‘jerk’ it out of local minima and
get to global minimum
 Regularizing effect
 Easier to handle M (~200)
samples instead of Ntrain (~105)
Choosing a batch size
Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo
Kyrola, Andrew Tulloch, Yangqing Jia, Kaiming He. "Accurate, Large Minibatch SGD:
Training ImageNet in 1 Hour". arXiv:1706.02677
M <= 8000 is fine
Dominic Masters, Carlo Luschi. ”Revisiting Small Batch
Training for Deep Neural Networks". arXiv:1804.07612 Recommend M = 16, 32, 64
But…
Both use the same dataset: Imagenet, with Ntrain > 1 million
So what batch size do I choose?
 Depends on the number of workers/threads in CPU/GPU
 Powers of 2 work well
 Keras default is 32 : Good for small datasets (Ntrain<10000)
 For bigger datasets, I would recommend trying 128, 256, even 1024
 Your computer might slow down with M > 1000
No consensus
Concept of Penalty Function
Penalty measures dislike
Eg: Penalty = Yellow dress
Big penalty
I absolutely
hate this guy
Small penalty
I don’t like this guy
No penalty
These people are finePic courtesy: https://www.123rf.com
L2 weight penalty (Ridge)
Regularized cost
Original unregularized
cost (like cross-entropy)
Regularization term
Update rule:
Regularization hyperparameter effect
Reduce training cost only
Overfitting
Reduce weight norm only
Underfitting
Typically
Effect on Conditioning
Consider a 2nd order Taylor approximation of the unregularized
cost function C0 around its optimum point w*, i.e. ∇C0(w*) = 0
Adding the gradient of the
regularization term to get gradient
of the regularized cost function C
Effect on Conditioning
Set the gradient to 0 to find optimum w of the regularized cost
Eigendecomposition of H
D is the diagonal matrix
of eigenvalues
Any eigenvalue di of H is replaced by di / (di + 2λ) => Improves conditioning
Effect on Weights
Level sets of C0
Level sets of norm penalty
Less important weights like w1 are
driven towards 0 by regularization
Pic courtesy: I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016, http://www.deeplearningbook.org.
L1 weight penalty (LASSO)
Regularized cost
Original unregularized
cost (like cross-entropy)
Regularization term
Update rule:
Sparsity
L1 promotes sparsity => Drives weights to 0
Pic courtesy: Found at https://datascience.stackexchange.com/questions/30237/does-l1-regularization-always-generate-a-sparse-solution
Function value Dislike Update value Action
L2 w2 = 0.01 Small ηλw = 0.1(ηλ) Leave it
L1 |w| = 0.1 Large ηλ*sign(w) = ηλ Make it smaller
Eg: Some weight w = 0.1
Other weight penalties
L0 : Count number of nonzero weights.
This is a great way to get sparsity, but not
differentiable and hard to analyze
Elastic Net - Combine L1 and L2 :
Reduce some weights, zero out some
W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li, “Learning structured sparsity in deep neural networks,”
in Proc. Advances in Neural Information Processing Systems 29 (NIPS), 2016, pp. 2074–2082.Also see Group lasso:
Weight penalty as MAP
Original cost C0
Weight penalty
Weight penalty is our prior belief about the weights
Weight penalty as MAP
Early stopping
Validation metrics are used as a proxy for test metrics
Stop training if validation
metrics don’t get better
Pic courtesy: I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016, http://www.deeplearningbook.org.
Dropout
For every minibatch during training, delete a
random selection of input and hidden nodes
Keep probability pk : Fraction of
units to keep (i.e. not drop)
But why dropout??
N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov,
“Dropout: A simple way to prevent neural networks from overfitting,”
Journal of Machine Learning Research, vol. 15, pp. 1929–1958, 2014 Pic courtesy: Michael A. Nielsen, "Neural Networks and Deep Learning", Determination Press, 2015
Ensemble Methods
Averaging the results of multiple
networks reduces errors
Let’s train an ensemble of k similar networks on the same problem and average their test results
(Random differences: Initialization, dataset shuffling)
Error of a single network:
Error of the ensemble:
But this process is expensive!
Return to dropout
Dropout is an ensemble method
using the same shared weights
=> Computationally cheaper
Improves robustness, since
individual nodes have to work
without depending on other
deleted nodes => Regularization
Drop Drop
Drop
Drop
How much dropout?
Input pk = 0.8 Hidden pk = 0.5 Output pk = 1
0.5 gives maximum regularization effect:
P. Baldi, P. J. Sadowski, “Understanding Dropout”, Proc. NIPS 2013
But you should try other values as well!
Drop input: Feature gone forever!
Some people use input pk = 1
Drop hidden:
Features still
propagate through
other hidden nodes
Drop output: Cannot classify!
Conv layer pk = 0.7
Original paper recommendations:
N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov,
“Dropout: A simple way to prevent neural networks from overfitting,”
Journal of Machine Learning Research, vol. 15, pp. 1929–1958, 2014
More on Dropout
• All nodes and weights are present in inference. Multiply weights by pk
during inference to compensate
• This assumes linearity, but still works!
• Dropout is like multiplicative 0-1 noise for each node
• Dropout is best for big networks
• Increase capacity, increase generalization power, but prevent overfitting
My Research: Pre-defined sparsity
Remove weights from
the very beginning
Train and test on a low
complexity network
Fully-connected network Pre-defined sparse network
S Dey, K.-W. Huang, P. A. Beerel, and K. M. Chugg `Pre-Defined Sparse Neural Networks with Hardware Acceleration,'' submitted to the IEEE Journal on Emerging
and Selected Topics in Circuits and Systems, Special Issue: Customized sub-systems and circuits for deep learning, 2018. Available online at arXiv:1812.01164.
How to design a good
connection pattern?
Outline
• Activation functions
• Cost function
• Optimizers
• Regularization
• Parameter initialization
• Normalization
• Data handling
• Hyperparameter selection
The Big Picture
Parameter Initialization
What is the initial value of p?
How about…. Same initialization, like all 0s
BAD
We want each weight to be useful on its own, not mirror other weights
Think of a company, do 2 people have exactly the same job?
Instead, initialize weights randomly
Glorot (Xavier) Normal Initialization
Imagine a linear function:
Say all w, x are IID:
For forward prop : Var(W(l)) = 1/N(l-1)
For backprop : Var(W(l)) = 1/N(l)
Compromise :
Despite a lot of assumptions, it works well!! Xavier Glorot, Yoshua Bengio. Proceedings of the Thirteenth International
Conference on Artificial Intelligence and Statistics, PMLR 9:249-256, 2010.
Why…
… should variances be equal?
… should we take harmonic
mean as compromise?
The original paper doesn’t
explicitly explain these
I think of it as similar to
vanishing and exploding
gradient issues.
Xavier Glorot, Yoshua Bengio. Proceedings of the Thirteenth International
Conference on Artificial Intelligence and Statistics, PMLR 9:249-256, 2010.
Glorot (Xavier) Uniform Initialization
Let
He Initialization
Glorot normal assumes linear functions
Actual activations like ReLU are non-linear
So…
Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun. Delving Deep into Rectifiers: Surpassing
Human-Level Performance on ImageNet Classification. Proceedings of ICCV ’15, pp 1026-1034.See for details:
Bias Initialization and Choosing an Initializer
He initialization works slightly better than Glorot
Nothing to choose between Normal and Uniform
Biases not as important as weights
All zeros work OK
For ReLU, give a small positive value like 0.1
to biases to prevent neurons dying
Comparison of Initializers
Weights
initialized
with all 0s
Weights
stay as
all 0s !!
Histograms of a few weights in 2nd
junction after training for 10 epochs
MNIST [784,200,10]
Regularization: None
Outline
• Activation functions
• Cost function
• Optimizers
• Regularization
• Parameter initialization
• Normalization
• Data handling
• Hyperparameter selection
Normalization
Imagine trying to predict how good a footballer is…
Feature Units Range
Height Meters 1.5 to 2
Weight Kilograms 50 to 100
Shot speed Kmph 120 to 180
Shot curve Degrees 0 to 10
Age Years 20 to 35
Minutes played Minutes 5,000 to 20,000
Fake diving? -- Yes / No
Different features have very different scales
Input normalization (Preprocessing)
Say there are n total training samples, each with f features
Eg for MNIST: n=50000, f=784
Compute stats for each feature
across all training samples
μi : Mean
σi : Standard deviation
Mi : Maximum value
mi : Minimum value
for all i from 1 to f
Essential preprocessing
Gaussian normalization:
Each feature is a unit Gaussian
Minmax normalization:
Each feature is in [0,1]
Feature dependency and scaling
We want to make features independent Why? Think dropout
We want to make this diagonal
--> Whitening
In fact, we want to make this Identity
to make features have same scale
Zero Components Analysis (ZCA) Whitening
Then…
BeforeAfter
Learning Multiple Layers of Features from Tiny Images, Alex Krizhevsky, 2009
Dimensionality Reduction
Inputs can have a LOT of features
Eg: MNIST has 784 features, but we
probably don’t need the corner pixels
Simplify NN architecture by keeping only the essential input features
Principal Component Analysis (PCA)
Keep features with maximum variance
=> features which change the most
f’ < f
Assume
sorted
Linear Discriminant Analysis (LDA)
Decrease variance within a class,
increase variance between classes
LDA keeps features which can
discriminate well between classes
Linear Discriminant Analysis (LDA)
C: Number of classes
Nc: Number of samples in class c
μc: Feature mean of class c
Intra (within)
class scatter
Minimize
Inter (between)
class scatter
Maximize
PCA vs LDA
PCA is unsupervised - just looks at data
LDA is supervised, looks at classes
Remember…
All statistics should be completed on training data only
Then apply them to validation and test data
Example:
Global Contrast Normalization (GCN)
Increase contrast (standard deviation
of pixels) of each image, one at a time
Different from other normalizations which
operate on all training images together
Batch normalization
Input normalization scales the input features. But what about internal
activations? Those are also input features to the layers after them
C. Szegedy, W. Liu, Y. Jia et al., “Going deeper with convolutions,” in IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2015, pp. 1–9.
Internal activations get shifted to crazy values in deep networks
Batch normalization
Re-normalize internal values using a minibatch,
then apply some trainable (μ,σ) = (β, 𝛾) to them
x can be any value inside any layer of the network
Normalize
ϵ is just a small number to prevent division by 0. Can be machine epsilon
Scale the normalized values and substitute
Sergey Ioffe, Christian Szegedy. Batch normalization: accelerating deep network training by reducing internal covariate shift. Proc. 32nd ICML, 37, pp 448-456, 2015.
Where to apply Batch Normalization?
Option 1: Before activation (at s)
Option 2: After activation (at a)
h(.)
s a
Original paper recommends before activation.
This means scaling the input to ReLU, then ReLU can
do its job by discarding anything <=0.
Scaling after ReLU doesn’t make sense for β since it
can be absorbed by the parameters of the next layer.
But… People still argue over this. I have done
both and saw almost no difference.
More notes on batch normalization
• BN prevents co-variate shift, i.e. each layer sees normalized inputs in
a desired range instead of some crazy shifted range due to prev layers
• BN also has a slight regularizing effect since each layer depends less
on previous layers, so its weights get smaller (kinda like what happens
in dropout)
• Finding μ and σ may not be applicable during validation or testing
since they don’t have a proper notion of batch size. Instead we can
use exponentially weighted averaging to adjust μ and σ as new
samples come in.
Outline
• Activation functions
• Cost function
• Optimizers
• Regularization
• Parameter initialization
• Normalization
• Data handling
• Hyperparameter selection
Data handling
 Collection
 Contamination
 Public datasets
 Synthetic data
 Augmentation
Collection and labeling
• Collected from real world sources - internet databases, surveys,
paying people, public records, hospitals, etc…
• Labeled manually, usually crowd-sourced (Mechanical Turk)
Examples from Imagenet
http://www.image-net.org
Contamination
HW1: Human vs Computer, Binary
Did anyone cheat by using numpy? ;-)
Ground truth
Predicted
Michael A. Nielsen, "Neural Networks and Deep Learning", Determination Press, 2015
Sometimes ground truth
labels can be unbelievable!
Neural Networks are Lazy
Train
For the NN, green background = cat
Test
Network output: Cat
Missing Entries
Example: Adult Dataset
https://archive.ics.uci.edu/ml/datasets/adult
Features:
• Age
• Working class
• Education
• Marital status
• Occupation
• Race
• Sex
• Capital gain
• Hours per week
• Native country
Label: Income level
Missing at Random: Remove the sample
Missing not at Random: Removing sample may produce bias
Example: People with low education level don’t reveal
Other ways:
Fill missing numerical values with mean or median. Eg: Age = 40
Fill missing categorical values with mode. Eg: Education = College
Small Dataset Size
Example: Wisconsin Breast Cancer Dataset
https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(original)
https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)
Features (image
of breast cells):
• Radius
• Perimeter
• Smoothness
• Compactness
• Symmetry
Labels:
• Malignant
• Benign
Original dataset: 699 samples with missing values
Modified dataset: 569 samples
Such small datasets generally not suited
for neural networks due to overfitting
Commonly used Image Datasets
• MNIST (MLP / CNN):
• 28x28 images, 10 classes
• Initial benchmark
• If you’re writing papers, don’t just do MNIST!
• SOTA testacc: >99%
• Harder Variation: Fashion MNIST
• CIFAR-10, -100 (CNN):
• 32x32x3 images (RGB), 10 or 100 classes
• Widely used benchmark
• SOTA testacc: ~97%, ~84%
Commonly used Image Datasets
• Imagenet (CNN):
• Overall database has >14M images
• ILSVRC challenge has >1M images, 224x224x3, 1000 classes
• Widely used benchmark, don’t attempt without enough computing power!
• Alexnet top-1, top-5 testacc: 62.5%, 83% (not SOTA, but very famous!)
• Simpler variation: Tiny Imagenet
• Microsoft Coco (CNN / Encoder-Decoder):
• 330k images, 80 categories
• Segmentation, object detection
• Street View House Numbers (CNN / MLP):
• Think MNIST, but more real-world, harder, bigger
• SOTA testacc > 98%
Other Commonly used Datasets
• TIMIT (Preprocessing + MLP + Decoder):
• Automatic Speech recognition (Guest lecture coming up on 2/27)
• Speaker and phoneme classification
• Reuters (MLP):
• Classify articles into news categories
• Multiple / tree structured labels
• IMDB Reviews:
• Natural language processing
• YouTube-8M:
• Annotate videos
Source for Datasets
• Kaggle: https://www.kaggle.com/datasets
• UCI ML repository: https://archive.ics.uci.edu/ml/datasets.html
• Google: https://toolbox.google.com/datasetsearch
• Amazon Web Services: https://registry.opendata.aws/
Synthetic Datasets
Data that is artificially manufactured (using algorithms),
rather than generated by real-world events
• Tunable algorithms to mimic real-world as required
• Cheap to produce large amounts of data
• We created data on Morse code symbols with tunable classification difficulty:
https://github.com/usc-hal/morse-dataset
• Can be used to augment real-world data
S. Dey, K. M. Chugg, and P. A. Beerel, “Morse code datasets for machine learning,” in Proc. 9th Int. Conf.
Computing, Communication and Networking Technologies (ICCCNT), Jul 2018, pp. 1–7. Won Best Paper Award.
Data Augmentation
More training results in overfitting
The solution: Create more data!
Eg: More MNIST images can simulate more
different ways of writing digits
If the network sees more, it can learn more
Data Augmentation Examples for Images
Original
Cropped
Flip Top-Bottom
Do NOT do for digits
Flip Left-Right
Transpose
Rotate 30°
Elastic
Transform
Simard, Steinkraus and Platt, "Best Practices
for Convolutional Neural Networks applied to
Visual Document Analysis", in Proc. Int. Conf.
on Document Analysis and Recognition, 2003.
Data Augmentation in Python
>> pip install Pillow
from PIL import Image
im = Image.fromarray((x.reshape(28,28)*255).astype('uint8'), mode='L’)
im.show()
im.transpose(Image.FLIP_LEFT_RIGHT)
im.transpose(Image.FLIP_TOP_BOTTOM)
im.transpose(Image.TRANSPOSE)
im.rotate(30)
im.crop(box=(5,5,23,23)) #(left,top,right,bottom)
I use the Pillow Imaging library
Outline
• Activation functions
• Cost function
• Optimizers
• Regularization
• Parameter initialization
• Normalization
• Data handling
• Hyperparameter selection
What is a Hyperparameter?
Anything which the network does NOT learn,
i.e. its value is adjusted by the user
Continuous Examples:
• Learning rate η
• Momentum α
• Regularization λ
Discrete Examples:
• What optimizer: SGD, Adam, RMSprop, …?
• What regularization: L2, L1, both, …?
• What initialization: Glorot, He, normal, uniform, …?
• Batch size M: 32, 64, 128, …?
• Number of MLP hidden layers: 1,2,3,…?
• How many neurons in each layer?
• What kinds of data augmentation?
Using data properly
Training Data
Feedforward +
Backprop + Update
Validation Data
Feedforward
Compute Metrics
Set parameters
Set hyperparameters
Test Data
Feedforward
Compute and report
Final Metrics
Do NOT use test data
until all parameters
and hyperparameters
are finalized
Complete training data
Cross-Validation
• Divide the dataset into k parts.
• Use all parts except i for training, then test on part i.
• Repeat for all i from 1 to k.
• Average all k test metrics to get final test metric.
Useful when dataset is small
Generally not needed for NN problems
Hyperparameter Search Strategies
Pic courtesy: J. Bergstra, Y. Bengio, “Random Search for Hyperparameter
Optimization,” Journal of Machine Learning Research, vol. 13, pp. 281–305, 2012
Grid search for (η,α):
(0.1,0.5), (0.1,0.9), (0.1,0.99),
(0.01,0.5), (0.01,0.9), (0.01,0.99),
(0.001,0.5), (0.001,0.9), (0.001,0.99)
Random search for (η,α):
(0.07,0.68), (0.002,0.94), (0.008,0.99),
(0.08,0.88), (0.005,0.62), (0.09,0.75),
(0.14,0.81), (0.006,0.76), (0.01,0.8)
Random search samples more values
of the important hyperparameter
Hyperparameter Optimization Algorithm : Tree-
Structured Parzen Estimator (TPE)
1. Decide initial probability distribution for each hyperparameter
 Example: η ~ logUniform(10-4,1); λ ~ logUniform(10-6,10-3)
2. Set some performance threshold
 Example: MNIST validation accuracy after 10 epochs = 96%
3. Sample multiple hyp_sets from the distributions
4. Train, and compute validation metrics for each hyp_set
5. Update distribution and threshold:
 If performance of a hyp_set is worse than threshold: Discard
 If performance of a hyp_set is better than threshold: Change distribution accordingly
6. Repeat 3-5
J. Bergstra, R. Bardenet, Y. Bengio, B. Kegl, “Algorithms for Hyper-Parameter Optimization”, Proc. NIPS, pp 2546-2554, 2011.
Potential project topic
Learning rate schedules
When using simple SGD, start with a big learning
rate, then decrease it for smoother convergence
Pic courtesy: http://prog3.com/sbdm/blog/google19890102/article/details/50276775
Learning rate schedules
Exponential decay
Step decay
Fractional decay
Used in Keras
Linear scale
Log scale
Initial η
Epochs
Other Learning rate strategies Potential project topic
Warmup: Increase η for a few epochs at the
beginning. Works well for large batch sizes.
Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz
Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, Kaiming He. "Accurate,
Large Minibatch SGD: Training ImageNet in 1 Hour". arXiv:1706.02677
Triangular Scheduling
L. N. Smith, “Cyclical Learning Rates for Training Neural
Networks”, arXiv:1506.01186
Pic courtesy https://www.jeremyjordan.me/nn-learning-rate/
Cosine Scheduling
I. Loshchilov, F. Hutter, “SGDR: Stochastic Gradient Descent
with Warm Restarts”, Proc. ICLR 2017.
That’s all
Thanks!

Techniques in Deep Learning

  • 1.
  • 2.
    Outline • Activation functions •Cost function • Optimizers • Regularization • Parameter initialization • Normalization • Data handling • Hyperparameter selection
  • 3.
  • 4.
    Outline • Activation functions •Cost function • Optimizers • Regularization • Parameter initialization • Normalization • Data handling • Hyperparameter selection
  • 5.
    Activation Functions Recall: Linear Non-Linearity Non-linearity isrequired to approximate any arbitrary function
  • 6.
  • 7.
    Vanishing gradients Recall fromBP (take example L=3) All these numbers are <= 0.25, so gradients become very small!!
  • 8.
    ReLU family ofactivations x >= 0 x < 0 Rectified Linear Unit (ReLU) x 0 Exponential Linear Unit (ELU) x α(ex - 1) Leaky ReLU x αx Biologically inspired - neurons firing vs not firing Solves vanishing gradient problem Non-differentiable at 0, replace with anything in [0,1] ReLU can die if x<0 Leaky ReLU solves this, but inconsistent results ELU saturates for x<0, so less resistant to noise Clevert, Djork-Arné; Unterthiner, Thomas; Hochreiter, Sepp (2015-11-23). "Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs)". arXiv:1511.07289
  • 9.
    Maxout networks -Generalization of ReLU Normally: For maxout: Learns the activation function itself Better approximation power Takes more computation Goodfellow, Ian J.; Warde-Farley, David; Mirza, Mehdi; Courville, Aaron; Bengio, Yoshua (2013). "Maxout Networks". JMLR Workshop and Conference Proceedings. 28 (3): 1319–1327.
  • 10.
    Example of maxout k= 2, N(l) = 5 ReLU is a special case of maxout with k=2 and 1 set of (W, b) = all 0s
  • 11.
    Which activation touse? Don’t use sigmoid Use ReLU If too many units are dead, try other activations And watch out for new activations (or maybe invent a new one) - deep learning moves fast!
  • 12.
    Output layer activation- Softmax Network output is a probability distribution! Extending logistic regression to multiple classes Compare to ideal output probability distribution
  • 13.
    Outline • Activation functions •Cost function • Optimizers • Regularization • Parameter initialization • Normalization • Data handling • Hyperparameter selection
  • 14.
    Cross-entropy Cost For binarylabels, this reduces to: Ground truth labels Network outputs Minimizing cross-entropy is the same as minimizing KL- divergence between the probability distributions y and a
  • 15.
    The one-hot case Thecorrect class r is 1, everything else is 0 Class 0 incorrect Class 1 incorrect Class N(L)-1 incorrect Class r correct
  • 16.
    Softmax and cross-entropywith one-hot labels This makes beautiful sense as the error vector! Recall: Combining:
  • 17.
    Example But remember: Weare interested in cost as a function of W, b
  • 18.
    Quadratic Cost (MSE) Mainlyused for regression, or cases where y and a are not probability distributions
  • 19.
    Which cost functionto use? Use cross-entropy for classification tasks It corresponds to Maximum Likelihood of a categorical distribution
  • 20.
    Outline • Activation functions •Cost function • Optimizers • Regularization • Parameter initialization • Normalization • Data handling • Hyperparameter selection
  • 21.
    Level sets Given A levelset is a set of points in the domain at which function value is constant Example k = 1 k = 9 k = 4 k = 0
  • 22.
    Level sets andgradient The gradient (and its negative) are always perpendicular to the level set Recall: Gradient at any point is the direction of maximum increase of the function at that point. Gradient cannot have a component along the level set
  • 23.
    Conditioning Informally, conditioning measures hownon-circular the level sets are Ill-conditioned: Much more sensitive to w2 Formally…
  • 24.
    Hessian: Matrix ofdouble derivatives Given For continuous functions: Hessian is symmetric
  • 25.
    Conditioning in termsof Hessian Conditioning is measured as the condition number of the Hessian of the cost function Examples: 𝜅 = 1 => Well-conditioned 𝜅 = 9 => Ill-conditioned By the way, don’t expect actual cost Hessians to be so simple… I’m using d for eigenvalues
  • 26.
  • 27.
    What is anOptimizer? A strategy or algorithm to reduce the cost function and make the network learn Eg: Gradient descent
  • 28.
    Gradient descent: Well-conditioned Well-conditionedlevel sets lead to fast convergence
  • 29.
    Gradient descent: Ill-conditioned Ill-conditionedlevel sets lead to slow convergence
  • 30.
    Momentum Equivalent to smoothingthe update by a low pass filter Normal update: Momentum update: How much of past history should be applied, typically ~0.9
  • 31.
    Why Momentum? Gradient descentwith momentum converges quickly even when ill-conditioned
  • 32.
    Alternate way toapply momentum Usual momentum update decouples α and η Alternative update couples them:
  • 33.
    Nesterov Momentum Normal update:Momentum update: Nesterov Momentum update: Nesterov, Y. (1983). A method of solving a convex programming problem with convergence rate O(1/k2). Soviet Mathematics Doklady, 27, 372–376. Correction factor applied to momentum May give faster convergence Pic courtesy: Geoffrey Hinton’s slides: http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf
  • 34.
    Adaptive Optimizers Learning rateshould be reduced with time. We don’t want to overshoot the minimum point and oscillate. Learning rate should be small for sensitive parameters, and vice-versa. Momentum factor should be increased with time to get smoother updates. Adaptive optimizers: • Adagrad • Adadelta • RMSprop • Adam • Adamax • Nadam • Other ‘Ada’-sounding words you can think of… https://www.youtube.com/watch?v=2lUFM8yTtUc
  • 35.
    RMSprop Scale each gradientby an exponentially weighted moving average of its past history Default ρ = 0.9 ϵ is just a small number to prevent division by 0. Can be machine epsilon Hinton, G. Neural networks for machine learning. Coursera, video lectures, 2012. http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf Sensitive parameters get low η
  • 36.
    Adam RMSprop with momentumand bias correction Momentum Exponentially weighted moving average of past history At time step (t+1): Bias corrections to make Defaults: η=0.001, ρ1 = 0.9, ρ2 = 0.999 D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in Proc. Int. Conf. Learning Representations (ICLR), 2014.
  • 37.
    Which optimizer touse? opt = SGD(eta=0.01, mom=0.95, decay=1e-5, nesterov=False) opt = Adam()
  • 38.
    Machine learning vsSimple Optimization Simple Optimization Machine Learning Goal Minimize f(x) Minimize C(w) on test data Typical problem size A few variables A few million variables Approach Gradient descent 2nd order methods Gradient descent on training data (2nd order methods not feasible) Stopping criterion x* = argmin f(x) ?? No access to test data Machine learning is about generalizing well on test data
  • 39.
    Outline • Activation functions •Cost function • Optimizers • Regularization • Parameter initialization • Normalization • Data handling • Hyperparameter selection
  • 40.
    Regularization Regularization is anymodification intended to reduce generalization (test) cost, but not training cost
  • 41.
    The problem ofoverfitting Ctrain = 0.385 Ctest = 0.49 Ctrain = 0.012 Ctest = 0.019 Ctrain = 0 Ctest = 145475 Underfitting: Order = 1 Perfect: Order = 2 Overfitting: Order = 9
  • 42.
    Bias-Variance Tradeoff Say the(unknown) optimum value of some parameter is w*, but the value we get from training is w For a given MSE, bias trades off with variance Reduce Bias: Make expected value of estimate close to true value Reduce Variance: Estimate bounds should be tight
  • 43.
    Bias-Variance Tradeoff Pic courtesy:I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016, http://www.deeplearningbook.org. Estimator not good enough Estimator too sensitive
  • 44.
    Minibatches Estimate true gradientusing data average of sample gradients of training data Batch gradient descent Best possible approximation to true gradient Minibatch gradient descent Noisy approximation to true gradient
  • 45.
    A Note onTerminology Batch size : Same as minibatch size M Batch GD : Use all samples Minibatch GD : 1 < M < Ntrain Online learning : M = 1 Stochastic GD : 1 <= M < Ntrain Basically the same as minibatch GD.
  • 46.
    Optimizers - Whyminibatches? Batch GD gets to local minimum  Noise in minibatch GD may ‘jerk’ it out of local minima and get to global minimum  Regularizing effect  Easier to handle M (~200) samples instead of Ntrain (~105)
  • 47.
    Choosing a batchsize Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, Kaiming He. "Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour". arXiv:1706.02677 M <= 8000 is fine Dominic Masters, Carlo Luschi. ”Revisiting Small Batch Training for Deep Neural Networks". arXiv:1804.07612 Recommend M = 16, 32, 64 But… Both use the same dataset: Imagenet, with Ntrain > 1 million
  • 48.
    So what batchsize do I choose?  Depends on the number of workers/threads in CPU/GPU  Powers of 2 work well  Keras default is 32 : Good for small datasets (Ntrain<10000)  For bigger datasets, I would recommend trying 128, 256, even 1024  Your computer might slow down with M > 1000 No consensus
  • 49.
    Concept of PenaltyFunction Penalty measures dislike Eg: Penalty = Yellow dress Big penalty I absolutely hate this guy Small penalty I don’t like this guy No penalty These people are finePic courtesy: https://www.123rf.com
  • 50.
    L2 weight penalty(Ridge) Regularized cost Original unregularized cost (like cross-entropy) Regularization term Update rule:
  • 51.
    Regularization hyperparameter effect Reducetraining cost only Overfitting Reduce weight norm only Underfitting Typically
  • 52.
    Effect on Conditioning Considera 2nd order Taylor approximation of the unregularized cost function C0 around its optimum point w*, i.e. ∇C0(w*) = 0 Adding the gradient of the regularization term to get gradient of the regularized cost function C
  • 53.
    Effect on Conditioning Setthe gradient to 0 to find optimum w of the regularized cost Eigendecomposition of H D is the diagonal matrix of eigenvalues Any eigenvalue di of H is replaced by di / (di + 2λ) => Improves conditioning
  • 54.
    Effect on Weights Levelsets of C0 Level sets of norm penalty Less important weights like w1 are driven towards 0 by regularization Pic courtesy: I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016, http://www.deeplearningbook.org.
  • 55.
    L1 weight penalty(LASSO) Regularized cost Original unregularized cost (like cross-entropy) Regularization term Update rule:
  • 56.
    Sparsity L1 promotes sparsity=> Drives weights to 0 Pic courtesy: Found at https://datascience.stackexchange.com/questions/30237/does-l1-regularization-always-generate-a-sparse-solution Function value Dislike Update value Action L2 w2 = 0.01 Small ηλw = 0.1(ηλ) Leave it L1 |w| = 0.1 Large ηλ*sign(w) = ηλ Make it smaller Eg: Some weight w = 0.1
  • 57.
    Other weight penalties L0: Count number of nonzero weights. This is a great way to get sparsity, but not differentiable and hard to analyze Elastic Net - Combine L1 and L2 : Reduce some weights, zero out some W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li, “Learning structured sparsity in deep neural networks,” in Proc. Advances in Neural Information Processing Systems 29 (NIPS), 2016, pp. 2074–2082.Also see Group lasso:
  • 58.
    Weight penalty asMAP Original cost C0 Weight penalty Weight penalty is our prior belief about the weights
  • 59.
  • 60.
    Early stopping Validation metricsare used as a proxy for test metrics Stop training if validation metrics don’t get better Pic courtesy: I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016, http://www.deeplearningbook.org.
  • 61.
    Dropout For every minibatchduring training, delete a random selection of input and hidden nodes Keep probability pk : Fraction of units to keep (i.e. not drop) But why dropout?? N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting,” Journal of Machine Learning Research, vol. 15, pp. 1929–1958, 2014 Pic courtesy: Michael A. Nielsen, "Neural Networks and Deep Learning", Determination Press, 2015
  • 62.
    Ensemble Methods Averaging theresults of multiple networks reduces errors Let’s train an ensemble of k similar networks on the same problem and average their test results (Random differences: Initialization, dataset shuffling) Error of a single network: Error of the ensemble: But this process is expensive!
  • 63.
    Return to dropout Dropoutis an ensemble method using the same shared weights => Computationally cheaper Improves robustness, since individual nodes have to work without depending on other deleted nodes => Regularization Drop Drop Drop Drop
  • 64.
    How much dropout? Inputpk = 0.8 Hidden pk = 0.5 Output pk = 1 0.5 gives maximum regularization effect: P. Baldi, P. J. Sadowski, “Understanding Dropout”, Proc. NIPS 2013 But you should try other values as well! Drop input: Feature gone forever! Some people use input pk = 1 Drop hidden: Features still propagate through other hidden nodes Drop output: Cannot classify! Conv layer pk = 0.7 Original paper recommendations: N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting,” Journal of Machine Learning Research, vol. 15, pp. 1929–1958, 2014
  • 65.
    More on Dropout •All nodes and weights are present in inference. Multiply weights by pk during inference to compensate • This assumes linearity, but still works! • Dropout is like multiplicative 0-1 noise for each node • Dropout is best for big networks • Increase capacity, increase generalization power, but prevent overfitting
  • 66.
    My Research: Pre-definedsparsity Remove weights from the very beginning Train and test on a low complexity network Fully-connected network Pre-defined sparse network S Dey, K.-W. Huang, P. A. Beerel, and K. M. Chugg `Pre-Defined Sparse Neural Networks with Hardware Acceleration,'' submitted to the IEEE Journal on Emerging and Selected Topics in Circuits and Systems, Special Issue: Customized sub-systems and circuits for deep learning, 2018. Available online at arXiv:1812.01164. How to design a good connection pattern?
  • 67.
    Outline • Activation functions •Cost function • Optimizers • Regularization • Parameter initialization • Normalization • Data handling • Hyperparameter selection
  • 68.
  • 69.
    Parameter Initialization What isthe initial value of p? How about…. Same initialization, like all 0s BAD We want each weight to be useful on its own, not mirror other weights Think of a company, do 2 people have exactly the same job? Instead, initialize weights randomly
  • 70.
    Glorot (Xavier) NormalInitialization Imagine a linear function: Say all w, x are IID: For forward prop : Var(W(l)) = 1/N(l-1) For backprop : Var(W(l)) = 1/N(l) Compromise : Despite a lot of assumptions, it works well!! Xavier Glorot, Yoshua Bengio. Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, PMLR 9:249-256, 2010.
  • 71.
    Why… … should variancesbe equal? … should we take harmonic mean as compromise? The original paper doesn’t explicitly explain these I think of it as similar to vanishing and exploding gradient issues. Xavier Glorot, Yoshua Bengio. Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, PMLR 9:249-256, 2010.
  • 72.
    Glorot (Xavier) UniformInitialization Let
  • 73.
    He Initialization Glorot normalassumes linear functions Actual activations like ReLU are non-linear So… Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. Proceedings of ICCV ’15, pp 1026-1034.See for details:
  • 74.
    Bias Initialization andChoosing an Initializer He initialization works slightly better than Glorot Nothing to choose between Normal and Uniform Biases not as important as weights All zeros work OK For ReLU, give a small positive value like 0.1 to biases to prevent neurons dying
  • 75.
    Comparison of Initializers Weights initialized withall 0s Weights stay as all 0s !! Histograms of a few weights in 2nd junction after training for 10 epochs MNIST [784,200,10] Regularization: None
  • 76.
    Outline • Activation functions •Cost function • Optimizers • Regularization • Parameter initialization • Normalization • Data handling • Hyperparameter selection
  • 77.
    Normalization Imagine trying topredict how good a footballer is… Feature Units Range Height Meters 1.5 to 2 Weight Kilograms 50 to 100 Shot speed Kmph 120 to 180 Shot curve Degrees 0 to 10 Age Years 20 to 35 Minutes played Minutes 5,000 to 20,000 Fake diving? -- Yes / No Different features have very different scales
  • 78.
    Input normalization (Preprocessing) Saythere are n total training samples, each with f features Eg for MNIST: n=50000, f=784 Compute stats for each feature across all training samples μi : Mean σi : Standard deviation Mi : Maximum value mi : Minimum value for all i from 1 to f
  • 79.
    Essential preprocessing Gaussian normalization: Eachfeature is a unit Gaussian Minmax normalization: Each feature is in [0,1]
  • 80.
    Feature dependency andscaling We want to make features independent Why? Think dropout We want to make this diagonal --> Whitening In fact, we want to make this Identity to make features have same scale
  • 81.
    Zero Components Analysis(ZCA) Whitening Then… BeforeAfter Learning Multiple Layers of Features from Tiny Images, Alex Krizhevsky, 2009
  • 82.
    Dimensionality Reduction Inputs canhave a LOT of features Eg: MNIST has 784 features, but we probably don’t need the corner pixels Simplify NN architecture by keeping only the essential input features
  • 83.
    Principal Component Analysis(PCA) Keep features with maximum variance => features which change the most f’ < f Assume sorted
  • 84.
    Linear Discriminant Analysis(LDA) Decrease variance within a class, increase variance between classes LDA keeps features which can discriminate well between classes
  • 85.
    Linear Discriminant Analysis(LDA) C: Number of classes Nc: Number of samples in class c μc: Feature mean of class c Intra (within) class scatter Minimize Inter (between) class scatter Maximize
  • 86.
    PCA vs LDA PCAis unsupervised - just looks at data LDA is supervised, looks at classes
  • 87.
    Remember… All statistics shouldbe completed on training data only Then apply them to validation and test data Example:
  • 88.
    Global Contrast Normalization(GCN) Increase contrast (standard deviation of pixels) of each image, one at a time Different from other normalizations which operate on all training images together
  • 89.
    Batch normalization Input normalizationscales the input features. But what about internal activations? Those are also input features to the layers after them C. Szegedy, W. Liu, Y. Jia et al., “Going deeper with convolutions,” in IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2015, pp. 1–9. Internal activations get shifted to crazy values in deep networks
  • 90.
    Batch normalization Re-normalize internalvalues using a minibatch, then apply some trainable (μ,σ) = (β, 𝛾) to them x can be any value inside any layer of the network Normalize ϵ is just a small number to prevent division by 0. Can be machine epsilon Scale the normalized values and substitute Sergey Ioffe, Christian Szegedy. Batch normalization: accelerating deep network training by reducing internal covariate shift. Proc. 32nd ICML, 37, pp 448-456, 2015.
  • 91.
    Where to applyBatch Normalization? Option 1: Before activation (at s) Option 2: After activation (at a) h(.) s a Original paper recommends before activation. This means scaling the input to ReLU, then ReLU can do its job by discarding anything <=0. Scaling after ReLU doesn’t make sense for β since it can be absorbed by the parameters of the next layer. But… People still argue over this. I have done both and saw almost no difference.
  • 92.
    More notes onbatch normalization • BN prevents co-variate shift, i.e. each layer sees normalized inputs in a desired range instead of some crazy shifted range due to prev layers • BN also has a slight regularizing effect since each layer depends less on previous layers, so its weights get smaller (kinda like what happens in dropout) • Finding μ and σ may not be applicable during validation or testing since they don’t have a proper notion of batch size. Instead we can use exponentially weighted averaging to adjust μ and σ as new samples come in.
  • 93.
    Outline • Activation functions •Cost function • Optimizers • Regularization • Parameter initialization • Normalization • Data handling • Hyperparameter selection
  • 94.
    Data handling  Collection Contamination  Public datasets  Synthetic data  Augmentation
  • 95.
    Collection and labeling •Collected from real world sources - internet databases, surveys, paying people, public records, hospitals, etc… • Labeled manually, usually crowd-sourced (Mechanical Turk) Examples from Imagenet http://www.image-net.org
  • 96.
    Contamination HW1: Human vsComputer, Binary Did anyone cheat by using numpy? ;-) Ground truth Predicted Michael A. Nielsen, "Neural Networks and Deep Learning", Determination Press, 2015 Sometimes ground truth labels can be unbelievable!
  • 97.
    Neural Networks areLazy Train For the NN, green background = cat Test Network output: Cat
  • 98.
    Missing Entries Example: AdultDataset https://archive.ics.uci.edu/ml/datasets/adult Features: • Age • Working class • Education • Marital status • Occupation • Race • Sex • Capital gain • Hours per week • Native country Label: Income level Missing at Random: Remove the sample Missing not at Random: Removing sample may produce bias Example: People with low education level don’t reveal Other ways: Fill missing numerical values with mean or median. Eg: Age = 40 Fill missing categorical values with mode. Eg: Education = College
  • 99.
    Small Dataset Size Example:Wisconsin Breast Cancer Dataset https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(original) https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic) Features (image of breast cells): • Radius • Perimeter • Smoothness • Compactness • Symmetry Labels: • Malignant • Benign Original dataset: 699 samples with missing values Modified dataset: 569 samples Such small datasets generally not suited for neural networks due to overfitting
  • 100.
    Commonly used ImageDatasets • MNIST (MLP / CNN): • 28x28 images, 10 classes • Initial benchmark • If you’re writing papers, don’t just do MNIST! • SOTA testacc: >99% • Harder Variation: Fashion MNIST • CIFAR-10, -100 (CNN): • 32x32x3 images (RGB), 10 or 100 classes • Widely used benchmark • SOTA testacc: ~97%, ~84%
  • 101.
    Commonly used ImageDatasets • Imagenet (CNN): • Overall database has >14M images • ILSVRC challenge has >1M images, 224x224x3, 1000 classes • Widely used benchmark, don’t attempt without enough computing power! • Alexnet top-1, top-5 testacc: 62.5%, 83% (not SOTA, but very famous!) • Simpler variation: Tiny Imagenet • Microsoft Coco (CNN / Encoder-Decoder): • 330k images, 80 categories • Segmentation, object detection • Street View House Numbers (CNN / MLP): • Think MNIST, but more real-world, harder, bigger • SOTA testacc > 98%
  • 102.
    Other Commonly usedDatasets • TIMIT (Preprocessing + MLP + Decoder): • Automatic Speech recognition (Guest lecture coming up on 2/27) • Speaker and phoneme classification • Reuters (MLP): • Classify articles into news categories • Multiple / tree structured labels • IMDB Reviews: • Natural language processing • YouTube-8M: • Annotate videos
  • 103.
    Source for Datasets •Kaggle: https://www.kaggle.com/datasets • UCI ML repository: https://archive.ics.uci.edu/ml/datasets.html • Google: https://toolbox.google.com/datasetsearch • Amazon Web Services: https://registry.opendata.aws/
  • 104.
    Synthetic Datasets Data thatis artificially manufactured (using algorithms), rather than generated by real-world events • Tunable algorithms to mimic real-world as required • Cheap to produce large amounts of data • We created data on Morse code symbols with tunable classification difficulty: https://github.com/usc-hal/morse-dataset • Can be used to augment real-world data S. Dey, K. M. Chugg, and P. A. Beerel, “Morse code datasets for machine learning,” in Proc. 9th Int. Conf. Computing, Communication and Networking Technologies (ICCCNT), Jul 2018, pp. 1–7. Won Best Paper Award.
  • 105.
    Data Augmentation More trainingresults in overfitting The solution: Create more data! Eg: More MNIST images can simulate more different ways of writing digits If the network sees more, it can learn more
  • 106.
    Data Augmentation Examplesfor Images Original Cropped Flip Top-Bottom Do NOT do for digits Flip Left-Right Transpose Rotate 30° Elastic Transform Simard, Steinkraus and Platt, "Best Practices for Convolutional Neural Networks applied to Visual Document Analysis", in Proc. Int. Conf. on Document Analysis and Recognition, 2003.
  • 107.
    Data Augmentation inPython >> pip install Pillow from PIL import Image im = Image.fromarray((x.reshape(28,28)*255).astype('uint8'), mode='L’) im.show() im.transpose(Image.FLIP_LEFT_RIGHT) im.transpose(Image.FLIP_TOP_BOTTOM) im.transpose(Image.TRANSPOSE) im.rotate(30) im.crop(box=(5,5,23,23)) #(left,top,right,bottom) I use the Pillow Imaging library
  • 108.
    Outline • Activation functions •Cost function • Optimizers • Regularization • Parameter initialization • Normalization • Data handling • Hyperparameter selection
  • 109.
    What is aHyperparameter? Anything which the network does NOT learn, i.e. its value is adjusted by the user Continuous Examples: • Learning rate η • Momentum α • Regularization λ Discrete Examples: • What optimizer: SGD, Adam, RMSprop, …? • What regularization: L2, L1, both, …? • What initialization: Glorot, He, normal, uniform, …? • Batch size M: 32, 64, 128, …? • Number of MLP hidden layers: 1,2,3,…? • How many neurons in each layer? • What kinds of data augmentation?
  • 110.
    Using data properly TrainingData Feedforward + Backprop + Update Validation Data Feedforward Compute Metrics Set parameters Set hyperparameters Test Data Feedforward Compute and report Final Metrics Do NOT use test data until all parameters and hyperparameters are finalized Complete training data
  • 111.
    Cross-Validation • Divide thedataset into k parts. • Use all parts except i for training, then test on part i. • Repeat for all i from 1 to k. • Average all k test metrics to get final test metric. Useful when dataset is small Generally not needed for NN problems
  • 112.
    Hyperparameter Search Strategies Piccourtesy: J. Bergstra, Y. Bengio, “Random Search for Hyperparameter Optimization,” Journal of Machine Learning Research, vol. 13, pp. 281–305, 2012 Grid search for (η,α): (0.1,0.5), (0.1,0.9), (0.1,0.99), (0.01,0.5), (0.01,0.9), (0.01,0.99), (0.001,0.5), (0.001,0.9), (0.001,0.99) Random search for (η,α): (0.07,0.68), (0.002,0.94), (0.008,0.99), (0.08,0.88), (0.005,0.62), (0.09,0.75), (0.14,0.81), (0.006,0.76), (0.01,0.8) Random search samples more values of the important hyperparameter
  • 113.
    Hyperparameter Optimization Algorithm: Tree- Structured Parzen Estimator (TPE) 1. Decide initial probability distribution for each hyperparameter  Example: η ~ logUniform(10-4,1); λ ~ logUniform(10-6,10-3) 2. Set some performance threshold  Example: MNIST validation accuracy after 10 epochs = 96% 3. Sample multiple hyp_sets from the distributions 4. Train, and compute validation metrics for each hyp_set 5. Update distribution and threshold:  If performance of a hyp_set is worse than threshold: Discard  If performance of a hyp_set is better than threshold: Change distribution accordingly 6. Repeat 3-5 J. Bergstra, R. Bardenet, Y. Bengio, B. Kegl, “Algorithms for Hyper-Parameter Optimization”, Proc. NIPS, pp 2546-2554, 2011. Potential project topic
  • 114.
    Learning rate schedules Whenusing simple SGD, start with a big learning rate, then decrease it for smoother convergence Pic courtesy: http://prog3.com/sbdm/blog/google19890102/article/details/50276775
  • 115.
    Learning rate schedules Exponentialdecay Step decay Fractional decay Used in Keras Linear scale Log scale Initial η Epochs
  • 116.
    Other Learning ratestrategies Potential project topic Warmup: Increase η for a few epochs at the beginning. Works well for large batch sizes. Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, Kaiming He. "Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour". arXiv:1706.02677 Triangular Scheduling L. N. Smith, “Cyclical Learning Rates for Training Neural Networks”, arXiv:1506.01186 Pic courtesy https://www.jeremyjordan.me/nn-learning-rate/ Cosine Scheduling I. Loshchilov, F. Hutter, “SGDR: Stochastic Gradient Descent with Warm Restarts”, Proc. ICLR 2017.
  • 117.

Editor's Notes

  • #54 do example with kappa=10, lambda=1