Techniques in Deep Learning

Techniques in
Deep Learning
Sourya Dey

Outline
• Activation functions
• Cost function
• Optimizers
• Regularization
• Parameter initialization
• Normalization
• Data handling
• Hyperparameter selection

Activation Functions
Recall:
Linear
Non-Linearity
Non-linearity is required to approximate any arbitrary function

Squashing activations
Sigmoid
Hyperbolic Tangent
(just rescaled sigmoid)

Vanishing gradients
Recall from BP (take example L=3)
All these numbers are <= 0.25, so
gradients become very small!!

ReLU family of activations
x >= 0 x < 0
Rectified Linear Unit (ReLU) x 0
Exponential Linear Unit (ELU) x α(ex - 1)
Leaky ReLU x αx
Biologically inspired - neurons firing vs not firing
Solves vanishing gradient problem
Non-differentiable at 0, replace with anything in [0,1]
ReLU can die if x<0
Leaky ReLU solves this, but inconsistent results
ELU saturates for x<0, so less resistant to noise
Clevert, Djork-Arné; Unterthiner, Thomas; Hochreiter, Sepp (2015-11-23). "Fast and
Accurate Deep Network Learning by Exponential Linear Units (ELUs)". arXiv:1511.07289

Maxout networks - Generalization of ReLU
Normally:
For maxout:
Learns the activation function itself
Better approximation power
Takes more computation Goodfellow, Ian J.; Warde-Farley, David; Mirza, Mehdi; Courville, Aaron; Bengio, Yoshua
(2013). "Maxout Networks". JMLR Workshop and Conference Proceedings. 28 (3): 1319–1327.

Example of maxout
k = 2, N(l) = 5
ReLU is a special case of maxout with k=2 and 1 set of (W, b) = all 0s

Which activation to use?
Don’t use sigmoid
Use ReLU
If too many units are dead, try other activations
And watch out for new activations (or maybe invent a new one) - deep learning moves fast!

Output layer activation - Softmax
Network output is a
probability distribution!
Extending logistic regression to multiple classes
Compare to ideal
output probability
distribution

Cross-entropy Cost
For binary labels, this reduces to:
Ground truth labels Network outputs
Minimizing cross-entropy is the
same as minimizing KL-
divergence between the
probability distributions y and a

The one-hot case
The correct class r is 1, everything else is 0
Class 0 incorrect
Class 1 incorrect
Class N(L)-1 incorrect
Class r correct

Softmax and cross-entropy with one-hot labels
This makes beautiful sense as the error vector!
Recall:
Combining:

Example
But remember: We are interested
in cost as a function of W, b

Quadratic Cost (MSE)
Mainly used for regression, or cases where
y and a are not probability distributions

Which cost function to use?
Use cross-entropy for classification tasks
It corresponds to Maximum Likelihood of a
categorical distribution

Level sets
Given
A level set is a set of points in the domain
at which function value is constant
Example
k = 1
k = 9
k = 4
k = 0

Level sets and gradient
The gradient (and its negative) are
always perpendicular to the level set
Recall: Gradient at any point is the
direction of maximum increase of
the function at that point.
Gradient cannot have a component
along the level set

Conditioning
Informally, conditioning measures
how non-circular the level sets are
Ill-conditioned: Much more sensitive to w2
Formally…

Hessian: Matrix of double derivatives
Given
For continuous functions:
Hessian is symmetric

Conditioning in terms of Hessian
Conditioning is measured as the condition
number of the Hessian of the cost function
Examples:
𝜅 = 1 => Well-conditioned
𝜅 = 9 => Ill-conditioned
By the way, don’t expect actual cost Hessians to be so simple…
I’m using d for
eigenvalues

What is an Optimizer?
A strategy or algorithm to reduce the cost
function and make the network learn
Eg: Gradient descent

Gradient descent: Well-conditioned
Well-conditioned level sets
lead to fast convergence

Gradient descent: Ill-conditioned
Ill-conditioned level sets
lead to slow convergence

Momentum
Equivalent to smoothing the update by a low pass filter
Normal update: Momentum update:
How much of past history should be applied, typically ~0.9

Why Momentum?
Gradient descent with momentum
converges quickly even when ill-conditioned

Alternate way to apply momentum
Usual momentum update decouples α and η
Alternative update couples them:

Nesterov Momentum
Normal update: Momentum update: Nesterov Momentum update:
Nesterov, Y. (1983). A method of solving a convex programming problem
with convergence rate O(1/k2). Soviet Mathematics Doklady, 27, 372–376.
Correction factor applied to momentum
May give faster convergence
Pic courtesy: Geoffrey Hinton’s slides: http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf

Adaptive Optimizers
Learning rate should be reduced with time. We don’t want to overshoot the
minimum point and oscillate.
Learning rate should be small for sensitive parameters, and vice-versa.
Momentum factor should be increased with time to get smoother updates.
Adaptive optimizers:
• Adagrad
• Adadelta
• RMSprop
• Adam
• Adamax
• Nadam
• Other ‘Ada’-sounding words you can think of…
https://www.youtube.com/watch?v=2lUFM8yTtUc

RMSprop
Scale each gradient by an exponentially
weighted moving average of its past history
Default ρ = 0.9
ϵ is just a small number to prevent division by
0. Can be machine epsilon
Hinton, G. Neural networks for machine learning. Coursera, video lectures, 2012.
http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf
Sensitive parameters get low η

Adam
RMSprop with momentum and bias correction
Momentum
Exponentially weighted
moving average of past history
At time step (t+1):
Bias corrections to make
Defaults:
η=0.001, ρ1 = 0.9, ρ2 = 0.999
D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in Proc. Int. Conf. Learning Representations (ICLR), 2014.

Which optimizer to use?
opt = SGD(eta=0.01, mom=0.95, decay=1e-5, nesterov=False) opt = Adam()

Machine learning vs Simple Optimization
Simple Optimization Machine Learning
Goal Minimize f(x) Minimize C(w) on test data
Typical problem size A few variables A few million variables
Approach Gradient descent
2nd order methods
Gradient descent on training data
(2nd order methods not feasible)
Stopping criterion x* = argmin f(x) ??
No access to test data
Machine learning is about generalizing well on test data

Regularization
Regularization is any modification
intended to reduce generalization
(test) cost, but not training cost

The problem of overfitting
Ctrain = 0.385
Ctest = 0.49
Ctrain = 0.012
Ctest = 0.019
Ctrain = 0
Ctest = 145475
Underfitting: Order = 1 Perfect: Order = 2 Overfitting: Order = 9

Bias-Variance Tradeoff
Say the (unknown) optimum value of some parameter is w*, but the value we get from training is w
For a given MSE, bias
trades off with variance
Reduce Bias: Make expected value of estimate close to true value
Reduce Variance: Estimate bounds should be tight

Bias-Variance Tradeoff
Pic courtesy: I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016, http://www.deeplearningbook.org.
Estimator not
good enough
Estimator too
sensitive

Minibatches
Estimate true gradient using
data average of sample
gradients of training data
Batch gradient descent
Best possible approximation to true gradient
Minibatch gradient descent
Noisy approximation to true gradient

A Note on Terminology
Batch size : Same as minibatch size M
Batch GD : Use all samples
Minibatch GD : 1 < M < Ntrain
Online learning : M = 1
Stochastic GD : 1 <= M < Ntrain
Basically the same as minibatch GD.

Optimizers - Why minibatches?
Batch GD gets to
local minimum
 Noise in minibatch GD may
‘jerk’ it out of local minima and
get to global minimum
 Regularizing effect
 Easier to handle M (~200)
samples instead of Ntrain (~105)

Choosing a batch size
Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo
Kyrola, Andrew Tulloch, Yangqing Jia, Kaiming He. "Accurate, Large Minibatch SGD:
Training ImageNet in 1 Hour". arXiv:1706.02677
M <= 8000 is fine
Dominic Masters, Carlo Luschi. ”Revisiting Small Batch
Training for Deep Neural Networks". arXiv:1804.07612 Recommend M = 16, 32, 64
But…
Both use the same dataset: Imagenet, with Ntrain > 1 million

So what batch size do I choose?
 Depends on the number of workers/threads in CPU/GPU
 Powers of 2 work well
 Keras default is 32 : Good for small datasets (Ntrain<10000)
 For bigger datasets, I would recommend trying 128, 256, even 1024
 Your computer might slow down with M > 1000
No consensus

Concept of Penalty Function
Penalty measures dislike
Eg: Penalty = Yellow dress
Big penalty
I absolutely
hate this guy
Small penalty
I don’t like this guy
No penalty
These people are finePic courtesy: https://www.123rf.com

L2 weight penalty (Ridge)
Regularized cost
Original unregularized
cost (like cross-entropy)
Regularization term
Update rule:

Regularization hyperparameter effect
Reduce training cost only
Overfitting
Reduce weight norm only
Underfitting
Typically

Effect on Conditioning
Consider a 2nd order Taylor approximation of the unregularized
cost function C0 around its optimum point w*, i.e. ∇C0(w*) = 0
Adding the gradient of the
regularization term to get gradient
of the regularized cost function C

Effect on Conditioning
Set the gradient to 0 to find optimum w of the regularized cost
Eigendecomposition of H
D is the diagonal matrix
of eigenvalues
Any eigenvalue di of H is replaced by di / (di + 2λ) => Improves conditioning

Effect on Weights
Level sets of C0
Level sets of norm penalty
Less important weights like w1 are
driven towards 0 by regularization

L1 weight penalty (LASSO)
Regularized cost
Original unregularized
cost (like cross-entropy)
Regularization term
Update rule:

Sparsity
L1 promotes sparsity => Drives weights to 0
Pic courtesy: Found at https://datascience.stackexchange.com/questions/30237/does-l1-regularization-always-generate-a-sparse-solution
Function value Dislike Update value Action
L2 w2 = 0.01 Small ηλw = 0.1(ηλ) Leave it
L1 |w| = 0.1 Large ηλ*sign(w) = ηλ Make it smaller
Eg: Some weight w = 0.1

Other weight penalties
L0 : Count number of nonzero weights.
This is a great way to get sparsity, but not
differentiable and hard to analyze
Elastic Net - Combine L1 and L2 :
Reduce some weights, zero out some
W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li, “Learning structured sparsity in deep neural networks,”
in Proc. Advances in Neural Information Processing Systems 29 (NIPS), 2016, pp. 2074–2082.Also see Group lasso:

Weight penalty as MAP
Original cost C0
Weight penalty
Weight penalty is our prior belief about the weights

Early stopping
Validation metrics are used as a proxy for test metrics
Stop training if validation
metrics don’t get better

Dropout
For every minibatch during training, delete a
random selection of input and hidden nodes
Keep probability pk : Fraction of
units to keep (i.e. not drop)
But why dropout??
N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov,
“Dropout: A simple way to prevent neural networks from overfitting,”
Journal of Machine Learning Research, vol. 15, pp. 1929–1958, 2014 Pic courtesy: Michael A. Nielsen, "Neural Networks and Deep Learning", Determination Press, 2015

Ensemble Methods
Averaging the results of multiple
networks reduces errors
Let’s train an ensemble of k similar networks on the same problem and average their test results
(Random differences: Initialization, dataset shuffling)
Error of a single network:
Error of the ensemble:
But this process is expensive!

Return to dropout
Dropout is an ensemble method
using the same shared weights
=> Computationally cheaper
Improves robustness, since
individual nodes have to work
without depending on other
deleted nodes => Regularization
Drop Drop
Drop
Drop

How much dropout?
Input pk = 0.8 Hidden pk = 0.5 Output pk = 1
0.5 gives maximum regularization effect:
P. Baldi, P. J. Sadowski, “Understanding Dropout”, Proc. NIPS 2013
But you should try other values as well!
Drop input: Feature gone forever!
Some people use input pk = 1
Drop hidden:
Features still
propagate through
other hidden nodes
Drop output: Cannot classify!
Conv layer pk = 0.7
Original paper recommendations:
N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov,
“Dropout: A simple way to prevent neural networks from overfitting,”
Journal of Machine Learning Research, vol. 15, pp. 1929–1958, 2014

More on Dropout
• All nodes and weights are present in inference. Multiply weights by pk
during inference to compensate
• This assumes linearity, but still works!
• Dropout is like multiplicative 0-1 noise for each node
• Dropout is best for big networks
• Increase capacity, increase generalization power, but prevent overfitting

My Research: Pre-defined sparsity
Remove weights from
the very beginning
Train and test on a low
complexity network
Fully-connected network Pre-defined sparse network
S Dey, K.-W. Huang, P. A. Beerel, and K. M. Chugg `Pre-Defined Sparse Neural Networks with Hardware Acceleration,'' submitted to the IEEE Journal on Emerging
and Selected Topics in Circuits and Systems, Special Issue: Customized sub-systems and circuits for deep learning, 2018. Available online at arXiv:1812.01164.
How to design a good
connection pattern?

Parameter Initialization
What is the initial value of p?
How about…. Same initialization, like all 0s
BAD
We want each weight to be useful on its own, not mirror other weights
Think of a company, do 2 people have exactly the same job?
Instead, initialize weights randomly

Glorot (Xavier) Normal Initialization
Imagine a linear function:
Say all w, x are IID:
For forward prop : Var(W(l)) = 1/N(l-1)
For backprop : Var(W(l)) = 1/N(l)
Compromise :
Despite a lot of assumptions, it works well!! Xavier Glorot, Yoshua Bengio. Proceedings of the Thirteenth International
Conference on Artificial Intelligence and Statistics, PMLR 9:249-256, 2010.

Why…
… should variances be equal?
… should we take harmonic
mean as compromise?
The original paper doesn’t
explicitly explain these
I think of it as similar to
vanishing and exploding
gradient issues.
Xavier Glorot, Yoshua Bengio. Proceedings of the Thirteenth International
Conference on Artificial Intelligence and Statistics, PMLR 9:249-256, 2010.

Glorot (Xavier) Uniform Initialization
Let

He Initialization
Glorot normal assumes linear functions
Actual activations like ReLU are non-linear
So…
Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun. Delving Deep into Rectifiers: Surpassing
Human-Level Performance on ImageNet Classification. Proceedings of ICCV ’15, pp 1026-1034.See for details:

Bias Initialization and Choosing an Initializer
He initialization works slightly better than Glorot
Nothing to choose between Normal and Uniform
Biases not as important as weights
All zeros work OK
For ReLU, give a small positive value like 0.1
to biases to prevent neurons dying

Comparison of Initializers
Weights
initialized
with all 0s
Weights
stay as
all 0s !!
Histograms of a few weights in 2nd
junction after training for 10 epochs
MNIST [784,200,10]
Regularization: None

Normalization
Imagine trying to predict how good a footballer is…
Feature Units Range
Height Meters 1.5 to 2
Weight Kilograms 50 to 100
Shot speed Kmph 120 to 180
Shot curve Degrees 0 to 10
Age Years 20 to 35
Minutes played Minutes 5,000 to 20,000
Fake diving? -- Yes / No
Different features have very different scales

Input normalization (Preprocessing)
Say there are n total training samples, each with f features
Eg for MNIST: n=50000, f=784
Compute stats for each feature
across all training samples
μi : Mean
σi : Standard deviation
Mi : Maximum value
mi : Minimum value
for all i from 1 to f

Essential preprocessing
Gaussian normalization:
Each feature is a unit Gaussian
Minmax normalization:
Each feature is in [0,1]

Feature dependency and scaling
We want to make features independent Why? Think dropout
We want to make this diagonal
--> Whitening
In fact, we want to make this Identity
to make features have same scale

Zero Components Analysis (ZCA) Whitening
Then…
BeforeAfter
Learning Multiple Layers of Features from Tiny Images, Alex Krizhevsky, 2009

Dimensionality Reduction
Inputs can have a LOT of features
Eg: MNIST has 784 features, but we
probably don’t need the corner pixels
Simplify NN architecture by keeping only the essential input features

Principal Component Analysis (PCA)
Keep features with maximum variance
=> features which change the most
f’ < f
Assume
sorted

Linear Discriminant Analysis (LDA)
Decrease variance within a class,
increase variance between classes
LDA keeps features which can
discriminate well between classes

Linear Discriminant Analysis (LDA)
C: Number of classes
Nc: Number of samples in class c
μc: Feature mean of class c
Intra (within)
class scatter
Minimize
Inter (between)
class scatter
Maximize

PCA vs LDA
PCA is unsupervised - just looks at data
LDA is supervised, looks at classes

Remember…
All statistics should be completed on training data only
Then apply them to validation and test data
Example:

Global Contrast Normalization (GCN)
Increase contrast (standard deviation
of pixels) of each image, one at a time
Different from other normalizations which
operate on all training images together

Batch normalization
Input normalization scales the input features. But what about internal
activations? Those are also input features to the layers after them
C. Szegedy, W. Liu, Y. Jia et al., “Going deeper with convolutions,” in IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2015, pp. 1–9.
Internal activations get shifted to crazy values in deep networks

Batch normalization
Re-normalize internal values using a minibatch,
then apply some trainable (μ,σ) = (β, 𝛾) to them
x can be any value inside any layer of the network
Normalize
ϵ is just a small number to prevent division by 0. Can be machine epsilon
Scale the normalized values and substitute
Sergey Ioffe, Christian Szegedy. Batch normalization: accelerating deep network training by reducing internal covariate shift. Proc. 32nd ICML, 37, pp 448-456, 2015.

Where to apply Batch Normalization?
Option 1: Before activation (at s)
Option 2: After activation (at a)
h(.)
s a
Original paper recommends before activation.
This means scaling the input to ReLU, then ReLU can
do its job by discarding anything <=0.
Scaling after ReLU doesn’t make sense for β since it
can be absorbed by the parameters of the next layer.
But… People still argue over this. I have done
both and saw almost no difference.

More notes on batch normalization
• BN prevents co-variate shift, i.e. each layer sees normalized inputs in
a desired range instead of some crazy shifted range due to prev layers
• BN also has a slight regularizing effect since each layer depends less
on previous layers, so its weights get smaller (kinda like what happens
in dropout)
• Finding μ and σ may not be applicable during validation or testing
since they don’t have a proper notion of batch size. Instead we can
use exponentially weighted averaging to adjust μ and σ as new
samples come in.

Data handling
 Collection
 Contamination
 Public datasets
 Synthetic data
 Augmentation

Collection and labeling
• Collected from real world sources - internet databases, surveys,
paying people, public records, hospitals, etc…
• Labeled manually, usually crowd-sourced (Mechanical Turk)
Examples from Imagenet
http://www.image-net.org

Contamination
HW1: Human vs Computer, Binary
Did anyone cheat by using numpy? ;-)
Ground truth
Predicted
Michael A. Nielsen, "Neural Networks and Deep Learning", Determination Press, 2015
Sometimes ground truth
labels can be unbelievable!

Neural Networks are Lazy
Train
For the NN, green background = cat
Test
Network output: Cat

Missing Entries
Example: Adult Dataset
https://archive.ics.uci.edu/ml/datasets/adult
Features:
• Age
• Working class
• Education
• Marital status
• Occupation
• Race
• Sex
• Capital gain
• Hours per week
• Native country
Label: Income level
Missing at Random: Remove the sample
Missing not at Random: Removing sample may produce bias
Example: People with low education level don’t reveal
Other ways:
Fill missing numerical values with mean or median. Eg: Age = 40
Fill missing categorical values with mode. Eg: Education = College

Small Dataset Size
Example: Wisconsin Breast Cancer Dataset
https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(original)
https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)
Features (image
of breast cells):
• Radius
• Perimeter
• Smoothness
• Compactness
• Symmetry
Labels:
• Malignant
• Benign
Original dataset: 699 samples with missing values
Modified dataset: 569 samples
Such small datasets generally not suited
for neural networks due to overfitting

Commonly used Image Datasets
• MNIST (MLP / CNN):
• 28x28 images, 10 classes
• Initial benchmark
• If you’re writing papers, don’t just do MNIST!
• SOTA testacc: >99%
• Harder Variation: Fashion MNIST
• CIFAR-10, -100 (CNN):
• 32x32x3 images (RGB), 10 or 100 classes
• Widely used benchmark
• SOTA testacc: ~97%, ~84%

Commonly used Image Datasets
• Imagenet (CNN):
• Overall database has >14M images
• ILSVRC challenge has >1M images, 224x224x3, 1000 classes
• Widely used benchmark, don’t attempt without enough computing power!
• Alexnet top-1, top-5 testacc: 62.5%, 83% (not SOTA, but very famous!)
• Simpler variation: Tiny Imagenet
• Microsoft Coco (CNN / Encoder-Decoder):
• 330k images, 80 categories
• Segmentation, object detection
• Street View House Numbers (CNN / MLP):
• Think MNIST, but more real-world, harder, bigger
• SOTA testacc > 98%

Other Commonly used Datasets
• TIMIT (Preprocessing + MLP + Decoder):
• Automatic Speech recognition (Guest lecture coming up on 2/27)
• Speaker and phoneme classification
• Reuters (MLP):
• Classify articles into news categories
• Multiple / tree structured labels
• IMDB Reviews:
• Natural language processing
• YouTube-8M:
• Annotate videos

Source for Datasets
• Kaggle: https://www.kaggle.com/datasets
• UCI ML repository: https://archive.ics.uci.edu/ml/datasets.html
• Google: https://toolbox.google.com/datasetsearch
• Amazon Web Services: https://registry.opendata.aws/

Synthetic Datasets
Data that is artificially manufactured (using algorithms),
rather than generated by real-world events
• Tunable algorithms to mimic real-world as required
• Cheap to produce large amounts of data
• We created data on Morse code symbols with tunable classification difficulty:
https://github.com/usc-hal/morse-dataset
• Can be used to augment real-world data
S. Dey, K. M. Chugg, and P. A. Beerel, “Morse code datasets for machine learning,” in Proc. 9th Int. Conf.
Computing, Communication and Networking Technologies (ICCCNT), Jul 2018, pp. 1–7. Won Best Paper Award.

Data Augmentation
More training results in overfitting
The solution: Create more data!
Eg: More MNIST images can simulate more
different ways of writing digits
If the network sees more, it can learn more

Data Augmentation Examples for Images
Original
Cropped
Flip Top-Bottom
Do NOT do for digits
Flip Left-Right
Transpose
Rotate 30°
Elastic
Transform
Simard, Steinkraus and Platt, "Best Practices
for Convolutional Neural Networks applied to
Visual Document Analysis", in Proc. Int. Conf.
on Document Analysis and Recognition, 2003.

Data Augmentation in Python
>> pip install Pillow
from PIL import Image
im = Image.fromarray((x.reshape(28,28)*255).astype('uint8'), mode='L’)
im.show()
im.transpose(Image.FLIP_LEFT_RIGHT)
im.transpose(Image.FLIP_TOP_BOTTOM)
im.transpose(Image.TRANSPOSE)
im.rotate(30)
im.crop(box=(5,5,23,23)) #(left,top,right,bottom)
I use the Pillow Imaging library

What is a Hyperparameter?
Anything which the network does NOT learn,
i.e. its value is adjusted by the user
Continuous Examples:
• Learning rate η
• Momentum α
• Regularization λ
Discrete Examples:
• What optimizer: SGD, Adam, RMSprop, …?
• What regularization: L2, L1, both, …?
• What initialization: Glorot, He, normal, uniform, …?
• Batch size M: 32, 64, 128, …?
• Number of MLP hidden layers: 1,2,3,…?
• How many neurons in each layer?
• What kinds of data augmentation?

Using data properly
Training Data
Feedforward +
Backprop + Update
Validation Data
Feedforward
Compute Metrics
Set parameters
Set hyperparameters
Test Data
Feedforward
Compute and report
Final Metrics
Do NOT use test data
until all parameters
and hyperparameters
are finalized
Complete training data

Cross-Validation
• Divide the dataset into k parts.
• Use all parts except i for training, then test on part i.
• Repeat for all i from 1 to k.
• Average all k test metrics to get final test metric.
Useful when dataset is small
Generally not needed for NN problems

Hyperparameter Search Strategies
Pic courtesy: J. Bergstra, Y. Bengio, “Random Search for Hyperparameter
Optimization,” Journal of Machine Learning Research, vol. 13, pp. 281–305, 2012
Grid search for (η,α):
(0.1,0.5), (0.1,0.9), (0.1,0.99),
(0.01,0.5), (0.01,0.9), (0.01,0.99),
(0.001,0.5), (0.001,0.9), (0.001,0.99)
Random search for (η,α):
(0.07,0.68), (0.002,0.94), (0.008,0.99),
(0.08,0.88), (0.005,0.62), (0.09,0.75),
(0.14,0.81), (0.006,0.76), (0.01,0.8)
Random search samples more values
of the important hyperparameter

Hyperparameter Optimization Algorithm : Tree-
Structured Parzen Estimator (TPE)
1. Decide initial probability distribution for each hyperparameter
 Example: η ~ logUniform(10-4,1); λ ~ logUniform(10-6,10-3)
2. Set some performance threshold
 Example: MNIST validation accuracy after 10 epochs = 96%
3. Sample multiple hyp_sets from the distributions
4. Train, and compute validation metrics for each hyp_set
5. Update distribution and threshold:
 If performance of a hyp_set is worse than threshold: Discard
 If performance of a hyp_set is better than threshold: Change distribution accordingly
6. Repeat 3-5
J. Bergstra, R. Bardenet, Y. Bengio, B. Kegl, “Algorithms for Hyper-Parameter Optimization”, Proc. NIPS, pp 2546-2554, 2011.
Potential project topic

Learning rate schedules
When using simple SGD, start with a big learning
rate, then decrease it for smoother convergence
Pic courtesy: http://prog3.com/sbdm/blog/google19890102/article/details/50276775

Learning rate schedules
Exponential decay
Step decay
Fractional decay
Used in Keras
Linear scale
Log scale
Initial η
Epochs

Other Learning rate strategies Potential project topic
Warmup: Increase η for a few epochs at the
beginning. Works well for large batch sizes.
Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz
Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, Kaiming He. "Accurate,
Large Minibatch SGD: Training ImageNet in 1 Hour". arXiv:1706.02677
Triangular Scheduling
L. N. Smith, “Cyclical Learning Rates for Training Neural
Networks”, arXiv:1506.01186
Pic courtesy https://www.jeremyjordan.me/nn-learning-rate/
Cosine Scheduling
I. Loshchilov, F. Hutter, “SGDR: Stochastic Gradient Descent
with Warm Restarts”, Proc. ICLR 2017.

Techniques in Deep Learning

More Related Content

What's hot

Similar to Techniques in Deep Learning

Recently uploaded

Techniques in Deep Learning

Editor's Notes