Dl1 deep learning_algorithms

deep learning
Algorithms and Applications
Bernardete Ribeiro, bribeiro@dei.uc.pt
University of Coimbra, Portugal
INIT/AERFAI Summer School on Machine Learning, Benicassim 22-26 June 2015

III - Deep Learning Algorithms
1

elements 3: deep neural networks

outline
∙ Learning in Deep Neural Networks
∙ Deep Learning: Evolution Timeline
∙ Deep Architectures
∙ Restricted Boltzmann Machines (RBMs)
∙ Deep Belief Networks (DBNs)
∙ Deep Models Overall Characteristics
3

learning in deep neural networks

learning in deep neural networks
1. No general learning algorithm (no-free lunch theorem by
Wolpert 1996)
2. Learning algorithm for specific tasks - perception, control,
prediction, planning reasoning, language understanding
3. Limitations of BP - local minima, optimization challenges
for non-convex objective functions
4. Hinton’s deep belief networks (DBNs) as stack of RBMs
5. LeCun’s energy based learning for DBNs
5

deep learning: evolution timeline
1. Perceptron [Frank Rosenblatt, 1959]
2. Neocognitron [K Fukushima, 1980]
3. Convolutional Neural Network (CNN) [LeCun, 1989]
4. Multi-level Hierarchy Networks [Jurgen Schmidthuber, 1992]
5. Deep Belief Networks (DBNs) as stack of RBMs [Geoffrey
Hinton, 2006]
6

from brain-like computing to deep learning
∙ New empirical and theoretical results have brought deep
architectures into the focus of the Machine Learning (ML)
researchers [Larochelle et al., 2007].
∙ Theoretical results suggest that deep architectures are
fundamental to learn the kind of brain-like complicated
functions that can represent high-level abstractions (e.g.
vision, speech, language) [Bengio, 2009]
8

deep neural networks
∙ Convolutional Neural Networks (CNNs) [LeCun et al., 1989]
∙ Deep Belief Networks (DBNs) [Hinton et al, 2006]
∙ AutoEncoders (AEs) [Bengio et al, NIPS 2006]
∙ Sparse Autoencoders [Ranzato et al, NIPS’2006]
10

convolutional neural networks (cnns)
∙ Convolutional Neural Network consists of two basic
operations
∙ convolutional
∙ pooling
∙ Convolutional and pooling layers
are arranged alternately until
high-level features are obtained
∙ Several feature maps in each
convolutional layer
∙ Weights in the same map are
shared
NN
input C1 S2 C3 S4
1
1
I Arel, D Rose & T Karnowski, Deep Machine Learning—A New Frontier in Artiﬁcial Intelligence Research, IEEE,
CIM,2010
11

convolutional neural networks (cnns)
∙ Convolutional: suppose the size of the layer is d × d
and the size of the receptive fields are r × r, γ and x
denote respectively the values of the convolutional
layer and the previous layer:
γij = g(
r
m=1
r
n=1
xi+m−1,j+n−1.wm,n + b)
i, j = 1, · · · , (d − r + 1) where g is a nonlinear function.
∙ Pooling is following after convolution to reduce the
dimensionality of features and to introduce
translational invariance into the CNN network.
12

deep belief networks (dbns)
∙ Probabilistic generative models
contrasting with the discriminative
nature of other NNS
∙ Generative models provide a joint
probability distribution of data
and labels
∙ Unsupervised greedy-layer-wise
pre-training followed by final
tuning
image 28 x 28 pixels
visible
hidden
visible
hidden
visible
hidden
Top Level units
Labels Hidden Units
RBM Layer
RBM Layer
RBM Layer
Detection Layer
2
2
based on I Arel, D Rose & T Karnowski, Deep Machine Learning—A New Frontier in Artiﬁcial Intelligence
Research, IEEE, CIM,2010
13

autoencoders (aes)
∙ The auto-encoder has two
components:
∙ the encoder f (mapping x to h) and
∙ the decoder g (mapping h to r)
∙ An auto-encoder is a neural
network that tries to reconstruct
its input to its output
encoder f
…
…
…
…
…
…
decoder g
input x
code h
reconstruction r
3
3
based on Y Bengio, I Goodfellow and A Courville, Deep Learning, An MIT Press book (in preparation),
www.iro.umontreal.ca_~bengioy_dbook
14

deep architectures versus shallow architectures
∙ Deep architectures can be exponentially more efficient
than shallow architectures [Roux and Bengio, 2010].
∙ Functions that can be compactly represented with a Neural
Network (NN) of depth d, may require an exponential number
of computational elements for a network with depth d − 1
[Bengio, 2009].
15

deep architectures versus shallow architectures
∙ Deep architectures can be exponentially more efficient
than shallow architectures [Roux and Bengio, 2010].
∙ Functions that can be compactly represented with a Neural
Network (NN) of depth d, may require an exponential number
of computational elements for a network with depth d − 1
[Bengio, 2009].
∙ Since the number of computational elements depends on
the number of training samples available, using shallow
architectures may result in poor generalization
models [Bengio, 2009].
∙ As a result, deep architecture models tend to outperform
shallow models such as SVMs [Larochelle et al., 2007].
15

Resctricted Boltzmann Machines
Deep Belief Networks
16

restricted boltzmann machines (rbms)
h1 h2 h3 · · · hj · · · hJ 1
bias
v1 v2 · · · vi · · · vI 1
bias
visible units
hidden units
decoder
encoder
18

∙ Unsupervised
∙ Find complex regularities in
training data
∙ Bipartite Graph
∙ visible, hidden layer
∙ Binary stochastic units
∙ On/Off with probability
∙ 1 Iteration
∙ Update Hidden Units
∙ Reconstruct Visible Units
∙ Maximum Likelihood of
training data
h1 h2 h3 · · · hj · · · hJ 1
bias
v1 v2 · · · vi · · · vI 1
bias
visible units
hidden units
encoder
19

∙ Training Goal: Best probable
reproduction
∙ unsupervised data
∙ find latent factors of data
set
∙ Adjust weights to get
maximum probability of
input data
h1 h2 h3 · · · hj · · · hJ 1
bias
v1 v2 · · · vi · · · vI 1
bias
visible units
hidden units
encoder
20

Given an observed state, the energy of the joint configuration
of the visible units and hidden units (v, h) is given by:
E(v, h) = −
I
i=1
civi −
J
j=1
bjhj −
J
j=1
I
i=1
Wjivihj , (1)
where W is the matrix of weights, and b and c are the bias
units w.r.t. hidden and visible layers, respectively.
h1 h2 h3 · · · hj · · · hJ 1
bias
v1 v2 · · · vi · · · vI 1
bias
visible units
hidden units
decoder
encoder
21

The Restricted Boltzmann Machine (RBM) assigns a
probability for each configuration (v, h), using:
p(v, h) =
e−E(v,h)
Z
, (2)
where Z is a normalization constant called partition function,
obtained by summing up the energy of all possible (v, h)
configurations [Bengio, 2009, Hinton, 2010,
Carreira-Perpiñán and Hinton, 2005]:
Z =
v,h
e−E(v,h)
. (3)
22

Since there are no connections between any two units within
the same layer, given a particular random input
configuration, v, all the hidden units are independent of each
other and the probability of h given v becomes:
p(h | v) =
j
p(hj = 1 | v) , (4)
where
p(hj = 1 | v) = σ(bj +
I
i=1
viWji) . (5)
23

Similarly given a specific hidden state, h, the probability of v
given h is obtained by (6):
p(v | h) =
i
p(vi = 1 | h) , (6)
where:
p(vi = 1 | h) = σ(ci +
J
j=1
hjWji) . (7)
24

Given a random training vector v, the state of a given hidden
unit j is set to 1 with probability:
p(hj = 1|v) = σ(bj +
i
viWij)
Similarly:
p(vi = 1|h) = σ(ci +
j
hjWij)
where σ (x) is the sigmoid squashing function 1
(1+e−x)
.
25

The marginal probability assigned to a visible vector, v, is
given by (8):
p(v) =
h
p(v, h) =
1
Z
h
e−E(v,h)
. (8)
Hence, given a specific training vector v its probability can be
raised by adjusting the weights and the biases in order to
lower the energy of that particular vector while raising the
energy of all the others.
26

To this end, we can perform stochastic gradient ascent
procedure on the log-likelihood obtained from training the
data vectors using ( 9):
∂ log p(v)
∂θ
= −
h
p(h | v)∂
E(v, h)
∂θ
positive phase
+
v,h
p(v, h)
∂E(v, h)
∂θ
negative phase
(9)
27

training an rbm
The learning rule for performing stochastic steepest ascent in
the log probability of the training data:
∂ log p(v)
∂θ
= vihj 0
− vihj ∞
(10)
where · 0 denotes expectations for the data distribution
(p0 = p(h | v)) and · ∞ denotes expectations under the
model distribution
p∞(v, h) = p(v, h) [Roux and Bengio, 2008].
h1 h2 h3 · · · hj · · · hJ 1
bias
v1 v2 · · · vi · · · vI 1
bias
visible units
hidden units
decoder
encoder
29

mcmc using alternating gibbs sampling
v(0) = x
i · · ·
h(0)
· · · j
vihj 0
p(hj = 1|v) = σ(bj + I
i=1 viWji)
30

v(0) = x
i · · ·
h(0)
· · · j
vihj 0
v(1)
i · · ·
p(vi = 1|h) = σ(ci + J
j=1
hjWji)
31

v(0) = x
i · · ·
h(0)
· · · j
vihj 0
v(1)
i · · ·
h(1)
· · · j
p(hj = 1|v) = σ(bj + I
i=1 viWji)
32

v(0) = x
i · · ·
h(0)
· · · j
vihj 0
v(1)
i · · ·
h(1)
· · · j
v(1)
i · · ·
p(vi = 1|h) = σ(ci + J
j=1
hjWji)
33

v(0) = x
i · · ·
h(0)
· · · j
vihj 0
v(1)
i · · ·
h(1)
· · · j
v(2)
i · · ·
h(2)
· · · j
v(∞)
i · · ·
h(∞)
· · · j
vihj ∞
34

contrastive divergence algorithm

contrastive divergence (cd–k)
∙ To solve this problem, Hinton proposed the Contrastive
Divergence algorithm.
∙ CD–k replaces . ∞ by · k for small values of k.
∆Wji = η( vihj 0
− vihj k
) (11)
36

contrastive divergence (cd–k)
∙ v(0) ← x
∙ Compute the binary (features) states of the hidden units,
h(0), using v(0)
∙ for n ← 1 to k
∙ Compute the “reconstruction” states for the visible units, v(n)
,
using h(n−1)
∙ Compute the “reconstruction” states for the hidden units, h(n)
,
using v(n)
∙ end for
∙ Update the weights and biases, according to:
∆Wji = η( vihj 0
− vihj k
) (12)
∆bj = η( hj 0
− hj k
) (13)
∆ci = η( vi 0 − vi k) (14)
37

x· · ·
h1· · ·
p(x|h1)p(h1|x)
x· · ·
h1· · ·
h2· · ·
p(x|h1)p(h1|x)
p(h1|h2)p(h2|h1)
x· · ·
h1· · ·
h2· · ·
h3· · ·
p(x|h1)p(h1|x)
p(h1|h2)p(h2|h1)
p(h2|h3)p(h3|h2)
39

∙ Start with a training vector
on the visible units
∙ Update all the hidden units
in parallel
∙ Update the all the visible
units in parallel to get a
“reconstruction”
∙ Update the hidden units
again
x· · ·
h1· · ·
p(x|h1)p(h1|x)
x· · ·
h1· · ·
h2· · ·
p(x|h1)p(h1|x)
p(h1|h2)p(h2|h1)
x· · ·
h1· · ·
h2· · ·
h3· · ·
p(x|h1)p(h1|x)
p(h1|h2)p(h2|h1)
p(h2|h3)p(h3|h2)
40

pre-training and fine tuning
RBM
data
500 hidden units
RBM
300 hidden units
500 hidden units
RBM
100 hidden units
300 hidden units
RBM
100 hidden units
10 hidden
data
update weights
500 hidden units
300 hidden units
100 hidden units
10 hidden
error < 0.001
BP
DBN Model
RBMs pre-training fine-tuning with BP
41

42

deep belief networks (dbns) - adaptive learning rate size
ηji =



uη(old)
ji
if ( vihj 0
− vihj k
)( vihj
(old)
0
− vihj
(old)
k
) > 0
dη(old)
ji
if ( vihj 0
− vihj k
)( vihj
(old)
0
− vihj
(old)
k
) < 0
4
4
Lopes et al., Towards Adaptive learning with improved
convergence of DBNs on GPUs, Pattern Recognition, [2014]
45

adaptive step size
0.10
0.15
0.20
0.25
0.30
0.35
0.40
0.45
0 100 200 300 400 500 600 700 800 900 1000
RMSE(reconstruction)
Epoch
α = 0.1
adaptive
γ = 0.1
γ = 0.4
γ = 0.7
0.10
0.15
0.20
0.25
0.30
0.35
0.40
0.45
0 100 200 300 400 500 600 700 800 900 1000
Epoch
α = 0.4
adaptive
γ = 0.1
γ = 0.4
γ = 0.7
0.10
0.15
0.20
0.25
0.30
0.35
0.40
0.45
0 100 200 300 400 500 600 700 800 900 1000
Epoch
α = 0.7
adaptive
γ = 0.1
γ = 0.4
γ = 0.7
Average reconstruction error (RMSE).
46

convergence results (α = 0.1)
Training images
Reconstruction
after 50 epochs
Reconstruction
after 100 epochs
Reconstruction
after 250 epochs
Reconstruction
after 500 epochs
Reconstruction
after 750 epochs
Reconstruc-
tion after
1000 epochs
Adaptive Step Size Fixed (optimized) learning rate η = 0.4
47

deep models characteristics
∙ Biological Plausibility
49

∙ DBNs are effective in a wide range of ML problems.
49

∙ Creating a Deep Belief Network (DBN) model is a time
consuming and computationally expensive task that
involves training several Restricted Boltzmann Machines
(RBMs) upholding considerable efforts.
49

∙ The adaptive step-size procedure for tuning the learning
rate has been incorporated in the learning model with
excelling results.
49

∙ The adaptive step-size procedure for tuning the learning
rate has been incorporated in the learning model with
excelling results.
∙ Graphics Processing Units (GPU) can reduce significantly
the convergence time for the data intensive tasks in DBNs
49

Bengio, Y. (2009).
Learning deep architectures for AI.
Foundations and Trends in Machine Learning, 2(1):1–127.
Carreira-Perpiñán, M. A. and Hinton, G. E. (2005).
On contrastive divergence learning.
In Proceedings of the 10th International Workshop on
Artiﬁcial Intelligence and Statistics (AISTATS 2005), pages
33–40.
Hinton, G. E. (2010).
A practical guide to training restricted Boltzmann
machines.
Technical report, Department of Computer Science,
University of Toronto.
Larochelle, H., Erhan, D., Courville, A., Bergstra, J., and
Bengio, Y. (2007).
49

An empirical evaluation of deep architectures on
problems with many factors of variation.
In Proceedings of the 24th international conference on
Machine learning (ICML 2007), pages 473–480. ACM.
Roux, N. L. and Bengio, Y. (2008).
Representational power of restricted Boltzmann
machines and deep belief networks.
Neural Computation, 20(6):1631–1649.
Roux, N. L. and Bengio, Y. (2010).
Deep belief networks are compact universal
approximators.
Neural Computation, 22(8):2192–2207.
50

deep learning
Algorithms and Applications
Bernardete Ribeiro, bribeiro@dei.uc.pt
June 24, 2015
University of Coimbra, Portugal
INIT/AERFAI Summer School on Machine Learning, Benicassim 22-26 June 2015

Dl1 deep learning_algorithms

More Related Content

What's hot

Viewers also liked

Similar to Dl1 deep learning_algorithms

More from Armando Vieira

Recently uploaded

Dl1 deep learning_algorithms