Introduction to Deep Learning

Deep Learning
1
In the name of God
Mehrnaz Faraz
Faculty of Electrical Engineering
K. N. Toosi University of Technology
Milad Abbasi
Faculty of Electrical Engineering
Sharif University of Technology

Why Deep Learning?
• Perform better for classification than other traditional
Machine Learning methods, because:
– Deep learning methods include multi layer
processing with less time and better accuracy
performance.
3

Deep Learning Applications
4
Image recognitionSpeech recognition
Robots and self driving carsHealthcare
Portfolio
management
Weather
forecast

Strengths and Challenges
• Strengths:
– No need for feature engineering
– Best results with unstructured data
– No need for labeling of data
– Efficient at delivering high-quality results
– Hardware and software support
• Challenges:
– The need for lots of data
– Neural networks at the core of deep learning are black
boxes
– Overfitting the model
– Lack of flexibility
5

Deep Learning Algorithms
• Unsupervised Learning
– Auto Encoders (AE)
– Generative Adversarial Networks (GAN)
• Supervised Learning
– Recurrent Neural Networks (RNN)
– Convolutional Neural Networks (CNN)
• Semi-Supervised Learning
• Reinforcement Learning
6

Unsupervised Learning
• Find structure or patterns in the unlabeled data
– Clustering
– Compression
– Feature & Representation learning
– Dimensionality reduction
– Generative models
7

Supervised Learning
• Learn a mapping function f where: y = f(x)
– Classification
– Regression
8
Data: (x,y)

Overfitting
• The model is fit too well to the training set, but does not
perform as well on the test set
• Occurs in complex networks with small data set
10
Overfitting

Overfitting
• Steps for reducing overfitting:
– Add more data
– Use data augmentation
– Use architectures that generalize well
– Add regularization (mostly dropout, L1/L2 regularization
are also possible)
– Reduce architecture complexity
11

Training Neural Network
• Data preparation
• Choosing a network architecture
• Training algorithm and optimization
• Improving training algorithm
– Improve convergence rate
12

Data preparation
• Data need to be made adequate for a given method
• Data in the real world is dirty
– Incomplete: Lacking attribute values, lacking certain
attributes of interest
– Noisy: Containing errors or outliers
13
More data = Better training
Removing incomplete and ruined data

Data preparation
• Data pre-processing:
– Normalization: Helps to prevent that attributes with large
ranges out-weight attributes with small ranges
• Min-max normalization:
• Z-score normalization:
– Doesn’t eliminate outliers
– Mean: 0 , Std: 1
14
 
   
 
min
_ max _ min _ min
max min
old old
new
old old
x x
x new new new
x x

  

 
 
minold old
new
old
x x
x
std x



Data preparation
• Histogram equalization:
– Is a technique for adjusting image intensities to enhance
contrast
15
Before histogram equalization After histogram equalization

Data preparation
• Data augmentation
– Means increasing the number of data points. In terms of
images, it may mean that increasing the number of
images in the dataset.
– Popular augmentation techniques (in terms of image):
• Flip
• Rotation
• Scale
• Crop
• Translation
• Gaussian noise
• Conditional GANs
16
Flip

Choosing a network architecture
• Layer’s type
– 1 dimensional data (signal, vector): FC, AE
– N dimensional data (image, tensor): CNN
– Time series data (speech, video, text): RNN
• Number of parameters
– Number of layers
– Number of neurons
• Start with a minimum of hidden layers and nodes.
Increase the hidden layers and nodes number until get a
good performance.
17
Try and error

Training algorithm and optimization
• Training:
• Error calculation:
– Cost/ Loss function:
• Mean squared
• Cross-entropy
• Softmax
18
Feed forward Error calculation
Back propagationUpdating parameters
Input
   
21
, i i
i
L y y y y
N
 

• Optimization:
– Step by step to get the lowest amount of error with
modifying and updating the weights.
– Difference value for cost function
– Goal: Get to global minimum
19
Loss surface

– Gradient descent variants:
• Stochastic gradient descent
• Batch gradient descent
• Mini-batch gradient descent
– Gradient descent optimization algorithm:
• Momentum
• Adagrad
• Adadelta
• Adam
• RMSprop
• …
20

Stochastic Gradient Descent
• A parameter updating for each training example 𝑥 𝑖
and
label 𝑦 𝑖
• Performs redundant computations for large datasets
• Performs frequent updates with a high variance
(Fluctuation)
21

Batch Gradient Descent
• Calculate the gradients for the whole dataset to perform
just one update
• can be very slow
• Learning rate (η) determining how big of an update we
perform
22
 new old J     

Mini-batch Gradient Descent
• Performs an update for every mini-batch of n training
examples
– reduces the variance of the parameter updates
– can make use of highly optimized matrix optimizations
– Common mini-batch sizes range between 50 and 256
23
   
 , ,
; ;
i i n i i n
new old J x y     
  

Improving Training Algorithms
• Batch normalization
• Regularization
• Dropout
• Transfer learning
24

• Is a normalization method/layer for neural networks
• Preventing from overfitting
• Reduces the dependence of network to weight initialization
• Improves the gradient flow through the network
• Improves the speed (η), performance, and stability of NN
• Allows higher learning rates
Batch Normalization
25

Batch Normalization
• How batch normalization prevents from overfitting?
– Reducing overfitting because of slight regularization
effects
– Similar to dropout, it adds some noise to each hidden
layer’s activations.
26

Regularization
• The process of introducing additional information to the
loss function: Regularization term
• Adds a penalty to explore certain regions in function
space
• V is the loss function
• λ is importance factor of regularization
27
1
min ( ( ), ) ( )
n
i i
w
i
V h x y R w



Regularization
• How regularization prevents from overfitting?
• L1 & L2 regularization:
– L1: Lasso Regression adds “absolute value of magnitude”
of coefficient as penalty term to the loss function.
– L2: Ridge regression adds “squared magnitude” of
coefficient as penalty term to the loss function.
28

Regularization
•
•
• Increasing λ Decreasing R(w) Decreasing w
29
 T
f w x b
Under-fitting Over-fittingAppropriate-fitting
1
min ( ( ), ) ( )
n
i i
w
i
V h x y R w



• Ignoring neurons during the training which is chosen at random.
• These neurons are not considered during a particular forward or
backward pass.
• Dropout probability = 0.5 usually works well
• Not used on the output layer
Dropout
30
Standard Neural Network After applying dropout

Dropout
• How dropout prevents from overfitting?
– Dropout is a fast regularization method
– Layer's "over-reliance" on a few of its inputs
– Network becomes less sensitive to the specific weights
of neurons
– Better generalization
31

Transfer Learning
• A model trained on one task is re-purposed on a second
related task.
• We first train a base network on a base (big) dataset and
task
• Repurpose the learned features, or transfer them, to a
second target network to be trained on a target dataset
and task
32

Transfer Learning
33
Big Data
Conv
Conv
Conv
Pool
Pool
FC
Conv
Soft Max
TrainingwithIMAGENET
Transfer the weights
Our Data
Conv
Conv
Conv
Pool
Pool
FC
Conv
Soft Max
Freeze these
(Low Data)
Train these

Hyper Parameter Optimization
• Hyper parameter:
– A parameter whose value is set before the learning
process begins
– Initial weights
– Number of layers
– Number of neurons
– Learning rate
– Convolution kernel width,…
34

• Manual tuning
– Monitor and visualize the loss curve/ accuracy
• Automatic optimization
– Random search
– Grid search
– Bayesian Hyper parameter optimization
35

• Bayesian Hyper parameter optimization:
– Build a probability model of the objective function and
use it to select the most promising hyper parameters to
evaluate in the true objective function
36

Cross Validation
• K-fold cross validation:
– A model is trained using (k-1) of the folds as training data
– The resulting model is validated on the remaining part of
the data
37

Weight initialization
• Zero initialization:
– Fully connected, no asymmetry
– In a layer, every neuron has the same output
– In a layer, all the weights update in the same way
38
InputSignal
OutputSignal
⋮
⋮
⋮⋮
0
0
0

• Small random numbers:
– Symmetry breaks
– Causes “Vanishing Gradient” flowing backward through
the network
• A concern for deep networks
39
'2 2 '1
11
2 2 1 1
1 1 1 1
1 2 2 1 1 1
1 1 1 1 1 1
1e xf w f
E E e o net o net
w e o net o net w
 
      
     
       

• Large random numbers:
– Symmetry breaks
– Causes saturation
– Causes “Exploding Gradient ” flowing backward through
the network
40
'2 2 '1
11
2 2 1 1
1 1 1 1
1 2 2 1 1 1
1 1 1 1 1 1
1e xf w f
E E e o net o net
w e o net o net w
 
      
     
       
 T
f w x b

• Train with multiple small random numbers
• Measure the errors
• Select the initial weights that produce the smallest
errors
41
…
𝑤1 𝑤2
𝑤3
𝑤′1 𝑤′2
𝑤′3
𝐸 𝑛𝐸1

Feature Selection
• Automatic
• Handy
– Forward selection
– Backward elimination
42

Forward Selection
• Begins with an empty model and adds in variables one by
one
• Adds the one variable that gives the single best
improvement to our model
43
𝒙 𝟏
𝒙 𝟑
𝒙 𝟐
𝒆 𝟏
𝒆 𝟐
𝒆 𝟑
1 2 3min( , , )e e e
Feature with minimum e is selected

Forward Selection
• Suppose that 𝑥1 is selected
• 𝑥1 and 𝑥2 𝑒12
• 𝑥1 and 𝑥3 𝑒13
44
12 13min( , )e e
Features with minimum e are selected

Backward Elimination
• Removes the least significant feature at each iteration
• Steps:
– Train the model with all the independent variables
– Eliminate independent variables with no improvement
on performance
– Repeat training until no improvement is observed on
removal of features
45

Introduction to Deep Learning

More Related Content

What's hot

Similar to Introduction to Deep Learning

Recently uploaded

Introduction to Deep Learning