deep learning
Algorithms and Applications
Bernardete Ribeiro, bribeiro@dei.uc.pt
University of Coimbra, Portugal
INIT/AERFAI Summer School on Machine Learning, Benicassim 22-26 June 2015
III - Deep Learning Algorithms
1
elements 3: deep neural networks
outline
∙ Learning in Deep Neural Networks
∙ Deep Learning: Evolution Timeline
∙ Deep Architectures
∙ Restricted Boltzmann Machines (RBMs)
∙ Deep Belief Networks (DBNs)
∙ Deep Models Overall Characteristics
3
learning in deep neural networks
learning in deep neural networks
1. No general learning algorithm (no-free lunch theorem by
Wolpert 1996)
2. Learning algorithm for specific tasks - perception, control,
prediction, planning reasoning, language understanding
3. Limitations of BP - local minima, optimization challenges
for non-convex objective functions
4. Hinton’s deep belief networks (DBNs) as stack of RBMs
5. LeCun’s energy based learning for DBNs
5
deep learning: evolution timeline
1. Perceptron [Frank Rosenblatt, 1959]
2. Neocognitron [K Fukushima, 1980]
3. Convolutional Neural Network (CNN) [LeCun, 1989]
4. Multi-level Hierarchy Networks [Jurgen Schmidthuber, 1992]
5. Deep Belief Networks (DBNs) as stack of RBMs [Geoffrey
Hinton, 2006]
6
deep architectures
from brain-like computing to deep learning
∙ New empirical and theoretical results have brought deep
architectures into the focus of the Machine Learning (ML)
researchers [Larochelle et al., 2007].
∙ Theoretical results suggest that deep architectures are
fundamental to learn the kind of brain-like complicated
functions that can represent high-level abstractions (e.g.
vision, speech, language) [Bengio, 2009]
8
deep concepts main idea
9
deep neural networks
∙ Convolutional Neural Networks (CNNs) [LeCun et al., 1989]
∙ Deep Belief Networks (DBNs) [Hinton et al, 2006]
∙ AutoEncoders (AEs) [Bengio et al, NIPS 2006]
∙ Sparse Autoencoders [Ranzato et al, NIPS’2006]
10
convolutional neural networks (cnns)
∙ Convolutional Neural Network consists of two basic
operations
∙ convolutional
∙ pooling
∙ Convolutional and pooling layers
are arranged alternately until
high-level features are obtained
∙ Several feature maps in each
convolutional layer
∙ Weights in the same map are
shared
NN
input C1 S2 C3 S4
1
1
I Arel, D Rose & T Karnowski, Deep Machine Learning—A New Frontier in Artificial Intelligence Research, IEEE,
CIM,2010
11
convolutional neural networks (cnns)
∙ Convolutional: suppose the size of the layer is d × d
and the size of the receptive fields are r × r, γ and x
denote respectively the values of the convolutional
layer and the previous layer:
γij = g(
r
m=1
r
n=1
xi+m−1,j+n−1.wm,n + b)
i, j = 1, · · · , (d − r + 1) where g is a nonlinear function.
∙ Pooling is following after convolution to reduce the
dimensionality of features and to introduce
translational invariance into the CNN network.
12
deep belief networks (dbns)
∙ Probabilistic generative models
contrasting with the discriminative
nature of other NNS
∙ Generative models provide a joint
probability distribution of data
and labels
∙ Unsupervised greedy-layer-wise
pre-training followed by final
tuning
image 28 x 28 pixels
visible
hidden
visible
hidden
visible
hidden
Top Level units
Labels Hidden Units
RBM Layer
RBM Layer
RBM Layer
Detection Layer
2
2
based on I Arel, D Rose & T Karnowski, Deep Machine Learning—A New Frontier in Artificial Intelligence
Research, IEEE, CIM,2010
13
autoencoders (aes)
∙ The auto-encoder has two
components:
∙ the encoder f (mapping x to h) and
∙ the decoder g (mapping h to r)
∙ An auto-encoder is a neural
network that tries to reconstruct
its input to its output
encoder f
…
…
…
…
…
…
decoder g
input x
code h
reconstruction r
3
3
based on Y Bengio, I Goodfellow and A Courville, Deep Learning, An MIT Press book (in preparation),
www.iro.umontreal.ca_~bengioy_dbook
14
deep architectures versus shallow architectures
∙ Deep architectures can be exponentially more efficient
than shallow architectures [Roux and Bengio, 2010].
∙ Functions that can be compactly represented with a Neural
Network (NN) of depth d, may require an exponential number
of computational elements for a network with depth d − 1
[Bengio, 2009].
15
deep architectures versus shallow architectures
∙ Deep architectures can be exponentially more efficient
than shallow architectures [Roux and Bengio, 2010].
∙ Functions that can be compactly represented with a Neural
Network (NN) of depth d, may require an exponential number
of computational elements for a network with depth d − 1
[Bengio, 2009].
∙ Since the number of computational elements depends on
the number of training samples available, using shallow
architectures may result in poor generalization
models [Bengio, 2009].
∙ As a result, deep architecture models tend to outperform
shallow models such as SVMs [Larochelle et al., 2007].
15
Resctricted Boltzmann Machines
Deep Belief Networks
16
restricted boltzmann machines
restricted boltzmann machines (rbms)
h1 h2 h3 · · · hj · · · hJ 1
bias
v1 v2 · · · vi · · · vI 1
bias
visible units
hidden units
decoder
encoder
18
restricted boltzmann machines (rbms)
∙ Unsupervised
∙ Find complex regularities in
training data
∙ Bipartite Graph
∙ visible, hidden layer
∙ Binary stochastic units
∙ On/Off with probability
∙ 1 Iteration
∙ Update Hidden Units
∙ Reconstruct Visible Units
∙ Maximum Likelihood of
training data
h1 h2 h3 · · · hj · · · hJ 1
bias
v1 v2 · · · vi · · · vI 1
bias
visible units
hidden units
encoder
19
restricted boltzmann machines (rbms)
∙ Training Goal: Best probable
reproduction
∙ unsupervised data
∙ find latent factors of data
set
∙ Adjust weights to get
maximum probability of
input data
h1 h2 h3 · · · hj · · · hJ 1
bias
v1 v2 · · · vi · · · vI 1
bias
visible units
hidden units
encoder
20
restricted boltzmann machines (rbms)
Given an observed state, the energy of the joint configuration
of the visible units and hidden units (v, h) is given by:
E(v, h) = −
I
i=1
civi −
J
j=1
bjhj −
J
j=1
I
i=1
Wjivihj , (1)
where W is the matrix of weights, and b and c are the bias
units w.r.t. hidden and visible layers, respectively.
h1 h2 h3 · · · hj · · · hJ 1
bias
v1 v2 · · · vi · · · vI 1
bias
visible units
hidden units
decoder
encoder
21
restricted boltzmann machines (rbms)
The Restricted Boltzmann Machine (RBM) assigns a
probability for each configuration (v, h), using:
p(v, h) =
e−E(v,h)
Z
, (2)
where Z is a normalization constant called partition function,
obtained by summing up the energy of all possible (v, h)
configurations [Bengio, 2009, Hinton, 2010,
Carreira-Perpiñán and Hinton, 2005]:
Z =
v,h
e−E(v,h)
. (3)
22
restricted boltzmann machines (rbms)
Since there are no connections between any two units within
the same layer, given a particular random input
configuration, v, all the hidden units are independent of each
other and the probability of h given v becomes:
p(h | v) =
j
p(hj = 1 | v) , (4)
where
p(hj = 1 | v) = σ(bj +
I
i=1
viWji) . (5)
23
restricted boltzmann machines (rbms)
Similarly given a specific hidden state, h, the probability of v
given h is obtained by (6):
p(v | h) =
i
p(vi = 1 | h) , (6)
where:
p(vi = 1 | h) = σ(ci +
J
j=1
hjWji) . (7)
24
restricted boltzmann machines (rbms)
Given a random training vector v, the state of a given hidden
unit j is set to 1 with probability:
p(hj = 1|v) = σ(bj +
i
viWij)
Similarly:
p(vi = 1|h) = σ(ci +
j
hjWij)
where σ (x) is the sigmoid squashing function 1
(1+e−x)
.
25
restricted boltzmann machines (rbms)
The marginal probability assigned to a visible vector, v, is
given by (8):
p(v) =
h
p(v, h) =
1
Z
h
e−E(v,h)
. (8)
Hence, given a specific training vector v its probability can be
raised by adjusting the weights and the biases in order to
lower the energy of that particular vector while raising the
energy of all the others.
26
restricted boltzmann machines (rbms)
To this end, we can perform stochastic gradient ascent
procedure on the log-likelihood obtained from training the
data vectors using ( 9):
∂ log p(v)
∂θ
= −
h
p(h | v)∂
E(v, h)
∂θ
positive phase
+
v,h
p(v, h)
∂E(v, h)
∂θ
negative phase
(9)
27
training an rbm
training an rbm
The learning rule for performing stochastic steepest ascent in
the log probability of the training data:
∂ log p(v)
∂θ
= vihj 0
− vihj ∞
(10)
where · 0 denotes expectations for the data distribution
(p0 = p(h | v)) and · ∞ denotes expectations under the
model distribution
p∞(v, h) = p(v, h) [Roux and Bengio, 2008].
h1 h2 h3 · · · hj · · · hJ 1
bias
v1 v2 · · · vi · · · vI 1
bias
visible units
hidden units
decoder
encoder
29
mcmc using alternating gibbs sampling
v(0) = x
i · · ·
h(0)
· · · j
vihj 0
p(hj = 1|v) = σ(bj + I
i=1 viWji)
30
mcmc using alternating gibbs sampling
v(0) = x
i · · ·
h(0)
· · · j
vihj 0
v(1)
i · · ·
p(vi = 1|h) = σ(ci + J
j=1
hjWji)
31
mcmc using alternating gibbs sampling
v(0) = x
i · · ·
h(0)
· · · j
vihj 0
v(1)
i · · ·
h(1)
· · · j
p(hj = 1|v) = σ(bj + I
i=1 viWji)
32
mcmc using alternating gibbs sampling
v(0) = x
i · · ·
h(0)
· · · j
vihj 0
v(1)
i · · ·
h(1)
· · · j
v(1)
i · · ·
p(vi = 1|h) = σ(ci + J
j=1
hjWji)
33
mcmc using alternating gibbs sampling
v(0) = x
i · · ·
h(0)
· · · j
vihj 0
v(1)
i · · ·
h(1)
· · · j
v(2)
i · · ·
h(2)
· · · j
v(∞)
i · · ·
h(∞)
· · · j
vihj ∞
34
contrastive divergence algorithm
contrastive divergence (cd–k)
∙ To solve this problem, Hinton proposed the Contrastive
Divergence algorithm.
∙ CD–k replaces . ∞ by · k for small values of k.
∆Wji = η( vihj 0
− vihj k
) (11)
36
contrastive divergence (cd–k)
∙ v(0) ← x
∙ Compute the binary (features) states of the hidden units,
h(0), using v(0)
∙ for n ← 1 to k
∙ Compute the “reconstruction” states for the visible units, v(n)
,
using h(n−1)
∙ Compute the “reconstruction” states for the hidden units, h(n)
,
using v(n)
∙ end for
∙ Update the weights and biases, according to:
∆Wji = η( vihj 0
− vihj k
) (12)
∆bj = η( hj 0
− hj k
) (13)
∆ci = η( vi 0 − vi k) (14)
37
deep belief networks (dbns)
deep belief networks (dbns)
x· · ·
h1· · ·
p(x|h1)p(h1|x)
x· · ·
h1· · ·
h2· · ·
p(x|h1)p(h1|x)
p(h1|h2)p(h2|h1)
x· · ·
h1· · ·
h2· · ·
h3· · ·
p(x|h1)p(h1|x)
p(h1|h2)p(h2|h1)
p(h2|h3)p(h3|h2)
39
deep belief networks (dbns)
∙ Start with a training vector
on the visible units
∙ Update all the hidden units
in parallel
∙ Update the all the visible
units in parallel to get a
“reconstruction”
∙ Update the hidden units
again
x· · ·
h1· · ·
p(x|h1)p(h1|x)
x· · ·
h1· · ·
h2· · ·
p(x|h1)p(h1|x)
p(h1|h2)p(h2|h1)
x· · ·
h1· · ·
h2· · ·
h3· · ·
p(x|h1)p(h1|x)
p(h1|h2)p(h2|h1)
p(h2|h3)p(h3|h2)
40
pre-training and fine tuning
RBM
data
500 hidden units
RBM
300 hidden units
500 hidden units
RBM
100 hidden units
300 hidden units
RBM
100 hidden units
10 hidden
data
update weights
500 hidden units
300 hidden units
100 hidden units
10 hidden
error < 0.001
BP
DBN Model
RBMs pre-training fine-tuning with BP
41
deep belief networks (dbns)
42
practical considerations
weights initialization
44
deep belief networks (dbns) - adaptive learning rate size
ηji =



uη(old)
ji
if ( vihj 0
− vihj k
)( vihj
(old)
0
− vihj
(old)
k
) > 0
dη(old)
ji
if ( vihj 0
− vihj k
)( vihj
(old)
0
− vihj
(old)
k
) < 0
4
4
Lopes et al., Towards Adaptive learning with improved
convergence of DBNs on GPUs, Pattern Recognition, [2014]
45
adaptive step size
0.10
0.15
0.20
0.25
0.30
0.35
0.40
0.45
0 100 200 300 400 500 600 700 800 900 1000
RMSE(reconstruction)
Epoch
α = 0.1
adaptive
γ = 0.1
γ = 0.4
γ = 0.7
0.10
0.15
0.20
0.25
0.30
0.35
0.40
0.45
0 100 200 300 400 500 600 700 800 900 1000
RMSE(reconstruction)
Epoch
α = 0.4
adaptive
γ = 0.1
γ = 0.4
γ = 0.7
0.10
0.15
0.20
0.25
0.30
0.35
0.40
0.45
0 100 200 300 400 500 600 700 800 900 1000
RMSE(reconstruction)
Epoch
α = 0.7
adaptive
γ = 0.1
γ = 0.4
γ = 0.7
Average reconstruction error (RMSE).
46
convergence results (α = 0.1)
Training images
Reconstruction
after 50 epochs
Reconstruction
after 100 epochs
Reconstruction
after 250 epochs
Reconstruction
after 500 epochs
Reconstruction
after 750 epochs
Reconstruc-
tion after
1000 epochs
Adaptive Step Size Fixed (optimized) learning rate η = 0.4
47
deep models characteristics
deep models characteristics
∙ Biological Plausibility
49
deep models characteristics
∙ Biological Plausibility
∙ DBNs are effective in a wide range of ML problems.
49
deep models characteristics
∙ Biological Plausibility
∙ DBNs are effective in a wide range of ML problems.
∙ Creating a Deep Belief Network (DBN) model is a time
consuming and computationally expensive task that
involves training several Restricted Boltzmann Machines
(RBMs) upholding considerable efforts.
49
deep models characteristics
∙ Biological Plausibility
∙ DBNs are effective in a wide range of ML problems.
∙ Creating a Deep Belief Network (DBN) model is a time
consuming and computationally expensive task that
involves training several Restricted Boltzmann Machines
(RBMs) upholding considerable efforts.
∙ The adaptive step-size procedure for tuning the learning
rate has been incorporated in the learning model with
excelling results.
49
deep models characteristics
∙ Biological Plausibility
∙ DBNs are effective in a wide range of ML problems.
∙ Creating a Deep Belief Network (DBN) model is a time
consuming and computationally expensive task that
involves training several Restricted Boltzmann Machines
(RBMs) upholding considerable efforts.
∙ The adaptive step-size procedure for tuning the learning
rate has been incorporated in the learning model with
excelling results.
∙ Graphics Processing Units (GPU) can reduce significantly
the convergence time for the data intensive tasks in DBNs
49
Bengio, Y. (2009).
Learning deep architectures for AI.
Foundations and Trends in Machine Learning, 2(1):1–127.
Carreira-Perpiñán, M. A. and Hinton, G. E. (2005).
On contrastive divergence learning.
In Proceedings of the 10th International Workshop on
Artificial Intelligence and Statistics (AISTATS 2005), pages
33–40.
Hinton, G. E. (2010).
A practical guide to training restricted Boltzmann
machines.
Technical report, Department of Computer Science,
University of Toronto.
Larochelle, H., Erhan, D., Courville, A., Bergstra, J., and
Bengio, Y. (2007).
49
An empirical evaluation of deep architectures on
problems with many factors of variation.
In Proceedings of the 24th international conference on
Machine learning (ICML 2007), pages 473–480. ACM.
Roux, N. L. and Bengio, Y. (2008).
Representational power of restricted Boltzmann
machines and deep belief networks.
Neural Computation, 20(6):1631–1649.
Roux, N. L. and Bengio, Y. (2010).
Deep belief networks are compact universal
approximators.
Neural Computation, 22(8):2192–2207.
50
Questions?
50
deep learning
Algorithms and Applications
Bernardete Ribeiro, bribeiro@dei.uc.pt
June 24, 2015
University of Coimbra, Portugal
INIT/AERFAI Summer School on Machine Learning, Benicassim 22-26 June 2015

Dl1 deep learning_algorithms

  • 1.
    deep learning Algorithms andApplications Bernardete Ribeiro, bribeiro@dei.uc.pt University of Coimbra, Portugal INIT/AERFAI Summer School on Machine Learning, Benicassim 22-26 June 2015
  • 2.
    III - DeepLearning Algorithms 1
  • 3.
    elements 3: deepneural networks
  • 4.
    outline ∙ Learning inDeep Neural Networks ∙ Deep Learning: Evolution Timeline ∙ Deep Architectures ∙ Restricted Boltzmann Machines (RBMs) ∙ Deep Belief Networks (DBNs) ∙ Deep Models Overall Characteristics 3
  • 5.
    learning in deepneural networks
  • 6.
    learning in deepneural networks 1. No general learning algorithm (no-free lunch theorem by Wolpert 1996) 2. Learning algorithm for specific tasks - perception, control, prediction, planning reasoning, language understanding 3. Limitations of BP - local minima, optimization challenges for non-convex objective functions 4. Hinton’s deep belief networks (DBNs) as stack of RBMs 5. LeCun’s energy based learning for DBNs 5
  • 7.
    deep learning: evolutiontimeline 1. Perceptron [Frank Rosenblatt, 1959] 2. Neocognitron [K Fukushima, 1980] 3. Convolutional Neural Network (CNN) [LeCun, 1989] 4. Multi-level Hierarchy Networks [Jurgen Schmidthuber, 1992] 5. Deep Belief Networks (DBNs) as stack of RBMs [Geoffrey Hinton, 2006] 6
  • 8.
  • 9.
    from brain-like computingto deep learning ∙ New empirical and theoretical results have brought deep architectures into the focus of the Machine Learning (ML) researchers [Larochelle et al., 2007]. ∙ Theoretical results suggest that deep architectures are fundamental to learn the kind of brain-like complicated functions that can represent high-level abstractions (e.g. vision, speech, language) [Bengio, 2009] 8
  • 10.
  • 11.
    deep neural networks ∙Convolutional Neural Networks (CNNs) [LeCun et al., 1989] ∙ Deep Belief Networks (DBNs) [Hinton et al, 2006] ∙ AutoEncoders (AEs) [Bengio et al, NIPS 2006] ∙ Sparse Autoencoders [Ranzato et al, NIPS’2006] 10
  • 12.
    convolutional neural networks(cnns) ∙ Convolutional Neural Network consists of two basic operations ∙ convolutional ∙ pooling ∙ Convolutional and pooling layers are arranged alternately until high-level features are obtained ∙ Several feature maps in each convolutional layer ∙ Weights in the same map are shared NN input C1 S2 C3 S4 1 1 I Arel, D Rose & T Karnowski, Deep Machine Learning—A New Frontier in Artificial Intelligence Research, IEEE, CIM,2010 11
  • 13.
    convolutional neural networks(cnns) ∙ Convolutional: suppose the size of the layer is d × d and the size of the receptive fields are r × r, γ and x denote respectively the values of the convolutional layer and the previous layer: γij = g( r m=1 r n=1 xi+m−1,j+n−1.wm,n + b) i, j = 1, · · · , (d − r + 1) where g is a nonlinear function. ∙ Pooling is following after convolution to reduce the dimensionality of features and to introduce translational invariance into the CNN network. 12
  • 14.
    deep belief networks(dbns) ∙ Probabilistic generative models contrasting with the discriminative nature of other NNS ∙ Generative models provide a joint probability distribution of data and labels ∙ Unsupervised greedy-layer-wise pre-training followed by final tuning image 28 x 28 pixels visible hidden visible hidden visible hidden Top Level units Labels Hidden Units RBM Layer RBM Layer RBM Layer Detection Layer 2 2 based on I Arel, D Rose & T Karnowski, Deep Machine Learning—A New Frontier in Artificial Intelligence Research, IEEE, CIM,2010 13
  • 15.
    autoencoders (aes) ∙ Theauto-encoder has two components: ∙ the encoder f (mapping x to h) and ∙ the decoder g (mapping h to r) ∙ An auto-encoder is a neural network that tries to reconstruct its input to its output encoder f … … … … … … decoder g input x code h reconstruction r 3 3 based on Y Bengio, I Goodfellow and A Courville, Deep Learning, An MIT Press book (in preparation), www.iro.umontreal.ca_~bengioy_dbook 14
  • 16.
    deep architectures versusshallow architectures ∙ Deep architectures can be exponentially more efficient than shallow architectures [Roux and Bengio, 2010]. ∙ Functions that can be compactly represented with a Neural Network (NN) of depth d, may require an exponential number of computational elements for a network with depth d − 1 [Bengio, 2009]. 15
  • 17.
    deep architectures versusshallow architectures ∙ Deep architectures can be exponentially more efficient than shallow architectures [Roux and Bengio, 2010]. ∙ Functions that can be compactly represented with a Neural Network (NN) of depth d, may require an exponential number of computational elements for a network with depth d − 1 [Bengio, 2009]. ∙ Since the number of computational elements depends on the number of training samples available, using shallow architectures may result in poor generalization models [Bengio, 2009]. ∙ As a result, deep architecture models tend to outperform shallow models such as SVMs [Larochelle et al., 2007]. 15
  • 18.
  • 19.
  • 20.
    restricted boltzmann machines(rbms) h1 h2 h3 · · · hj · · · hJ 1 bias v1 v2 · · · vi · · · vI 1 bias visible units hidden units decoder encoder 18
  • 21.
    restricted boltzmann machines(rbms) ∙ Unsupervised ∙ Find complex regularities in training data ∙ Bipartite Graph ∙ visible, hidden layer ∙ Binary stochastic units ∙ On/Off with probability ∙ 1 Iteration ∙ Update Hidden Units ∙ Reconstruct Visible Units ∙ Maximum Likelihood of training data h1 h2 h3 · · · hj · · · hJ 1 bias v1 v2 · · · vi · · · vI 1 bias visible units hidden units encoder 19
  • 22.
    restricted boltzmann machines(rbms) ∙ Training Goal: Best probable reproduction ∙ unsupervised data ∙ find latent factors of data set ∙ Adjust weights to get maximum probability of input data h1 h2 h3 · · · hj · · · hJ 1 bias v1 v2 · · · vi · · · vI 1 bias visible units hidden units encoder 20
  • 23.
    restricted boltzmann machines(rbms) Given an observed state, the energy of the joint configuration of the visible units and hidden units (v, h) is given by: E(v, h) = − I i=1 civi − J j=1 bjhj − J j=1 I i=1 Wjivihj , (1) where W is the matrix of weights, and b and c are the bias units w.r.t. hidden and visible layers, respectively. h1 h2 h3 · · · hj · · · hJ 1 bias v1 v2 · · · vi · · · vI 1 bias visible units hidden units decoder encoder 21
  • 24.
    restricted boltzmann machines(rbms) The Restricted Boltzmann Machine (RBM) assigns a probability for each configuration (v, h), using: p(v, h) = e−E(v,h) Z , (2) where Z is a normalization constant called partition function, obtained by summing up the energy of all possible (v, h) configurations [Bengio, 2009, Hinton, 2010, Carreira-Perpiñán and Hinton, 2005]: Z = v,h e−E(v,h) . (3) 22
  • 25.
    restricted boltzmann machines(rbms) Since there are no connections between any two units within the same layer, given a particular random input configuration, v, all the hidden units are independent of each other and the probability of h given v becomes: p(h | v) = j p(hj = 1 | v) , (4) where p(hj = 1 | v) = σ(bj + I i=1 viWji) . (5) 23
  • 26.
    restricted boltzmann machines(rbms) Similarly given a specific hidden state, h, the probability of v given h is obtained by (6): p(v | h) = i p(vi = 1 | h) , (6) where: p(vi = 1 | h) = σ(ci + J j=1 hjWji) . (7) 24
  • 27.
    restricted boltzmann machines(rbms) Given a random training vector v, the state of a given hidden unit j is set to 1 with probability: p(hj = 1|v) = σ(bj + i viWij) Similarly: p(vi = 1|h) = σ(ci + j hjWij) where σ (x) is the sigmoid squashing function 1 (1+e−x) . 25
  • 28.
    restricted boltzmann machines(rbms) The marginal probability assigned to a visible vector, v, is given by (8): p(v) = h p(v, h) = 1 Z h e−E(v,h) . (8) Hence, given a specific training vector v its probability can be raised by adjusting the weights and the biases in order to lower the energy of that particular vector while raising the energy of all the others. 26
  • 29.
    restricted boltzmann machines(rbms) To this end, we can perform stochastic gradient ascent procedure on the log-likelihood obtained from training the data vectors using ( 9): ∂ log p(v) ∂θ = − h p(h | v)∂ E(v, h) ∂θ positive phase + v,h p(v, h) ∂E(v, h) ∂θ negative phase (9) 27
  • 30.
  • 31.
    training an rbm Thelearning rule for performing stochastic steepest ascent in the log probability of the training data: ∂ log p(v) ∂θ = vihj 0 − vihj ∞ (10) where · 0 denotes expectations for the data distribution (p0 = p(h | v)) and · ∞ denotes expectations under the model distribution p∞(v, h) = p(v, h) [Roux and Bengio, 2008]. h1 h2 h3 · · · hj · · · hJ 1 bias v1 v2 · · · vi · · · vI 1 bias visible units hidden units decoder encoder 29
  • 32.
    mcmc using alternatinggibbs sampling v(0) = x i · · · h(0) · · · j vihj 0 p(hj = 1|v) = σ(bj + I i=1 viWji) 30
  • 33.
    mcmc using alternatinggibbs sampling v(0) = x i · · · h(0) · · · j vihj 0 v(1) i · · · p(vi = 1|h) = σ(ci + J j=1 hjWji) 31
  • 34.
    mcmc using alternatinggibbs sampling v(0) = x i · · · h(0) · · · j vihj 0 v(1) i · · · h(1) · · · j p(hj = 1|v) = σ(bj + I i=1 viWji) 32
  • 35.
    mcmc using alternatinggibbs sampling v(0) = x i · · · h(0) · · · j vihj 0 v(1) i · · · h(1) · · · j v(1) i · · · p(vi = 1|h) = σ(ci + J j=1 hjWji) 33
  • 36.
    mcmc using alternatinggibbs sampling v(0) = x i · · · h(0) · · · j vihj 0 v(1) i · · · h(1) · · · j v(2) i · · · h(2) · · · j v(∞) i · · · h(∞) · · · j vihj ∞ 34
  • 37.
  • 38.
    contrastive divergence (cd–k) ∙To solve this problem, Hinton proposed the Contrastive Divergence algorithm. ∙ CD–k replaces . ∞ by · k for small values of k. ∆Wji = η( vihj 0 − vihj k ) (11) 36
  • 39.
    contrastive divergence (cd–k) ∙v(0) ← x ∙ Compute the binary (features) states of the hidden units, h(0), using v(0) ∙ for n ← 1 to k ∙ Compute the “reconstruction” states for the visible units, v(n) , using h(n−1) ∙ Compute the “reconstruction” states for the hidden units, h(n) , using v(n) ∙ end for ∙ Update the weights and biases, according to: ∆Wji = η( vihj 0 − vihj k ) (12) ∆bj = η( hj 0 − hj k ) (13) ∆ci = η( vi 0 − vi k) (14) 37
  • 40.
  • 41.
    deep belief networks(dbns) x· · · h1· · · p(x|h1)p(h1|x) x· · · h1· · · h2· · · p(x|h1)p(h1|x) p(h1|h2)p(h2|h1) x· · · h1· · · h2· · · h3· · · p(x|h1)p(h1|x) p(h1|h2)p(h2|h1) p(h2|h3)p(h3|h2) 39
  • 42.
    deep belief networks(dbns) ∙ Start with a training vector on the visible units ∙ Update all the hidden units in parallel ∙ Update the all the visible units in parallel to get a “reconstruction” ∙ Update the hidden units again x· · · h1· · · p(x|h1)p(h1|x) x· · · h1· · · h2· · · p(x|h1)p(h1|x) p(h1|h2)p(h2|h1) x· · · h1· · · h2· · · h3· · · p(x|h1)p(h1|x) p(h1|h2)p(h2|h1) p(h2|h3)p(h3|h2) 40
  • 43.
    pre-training and finetuning RBM data 500 hidden units RBM 300 hidden units 500 hidden units RBM 100 hidden units 300 hidden units RBM 100 hidden units 10 hidden data update weights 500 hidden units 300 hidden units 100 hidden units 10 hidden error < 0.001 BP DBN Model RBMs pre-training fine-tuning with BP 41
  • 44.
  • 45.
  • 46.
  • 47.
    deep belief networks(dbns) - adaptive learning rate size ηji =    uη(old) ji if ( vihj 0 − vihj k )( vihj (old) 0 − vihj (old) k ) > 0 dη(old) ji if ( vihj 0 − vihj k )( vihj (old) 0 − vihj (old) k ) < 0 4 4 Lopes et al., Towards Adaptive learning with improved convergence of DBNs on GPUs, Pattern Recognition, [2014] 45
  • 48.
    adaptive step size 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0100 200 300 400 500 600 700 800 900 1000 RMSE(reconstruction) Epoch α = 0.1 adaptive γ = 0.1 γ = 0.4 γ = 0.7 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0 100 200 300 400 500 600 700 800 900 1000 RMSE(reconstruction) Epoch α = 0.4 adaptive γ = 0.1 γ = 0.4 γ = 0.7 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0 100 200 300 400 500 600 700 800 900 1000 RMSE(reconstruction) Epoch α = 0.7 adaptive γ = 0.1 γ = 0.4 γ = 0.7 Average reconstruction error (RMSE). 46
  • 49.
    convergence results (α= 0.1) Training images Reconstruction after 50 epochs Reconstruction after 100 epochs Reconstruction after 250 epochs Reconstruction after 500 epochs Reconstruction after 750 epochs Reconstruc- tion after 1000 epochs Adaptive Step Size Fixed (optimized) learning rate η = 0.4 47
  • 50.
  • 51.
    deep models characteristics ∙Biological Plausibility 49
  • 52.
    deep models characteristics ∙Biological Plausibility ∙ DBNs are effective in a wide range of ML problems. 49
  • 53.
    deep models characteristics ∙Biological Plausibility ∙ DBNs are effective in a wide range of ML problems. ∙ Creating a Deep Belief Network (DBN) model is a time consuming and computationally expensive task that involves training several Restricted Boltzmann Machines (RBMs) upholding considerable efforts. 49
  • 54.
    deep models characteristics ∙Biological Plausibility ∙ DBNs are effective in a wide range of ML problems. ∙ Creating a Deep Belief Network (DBN) model is a time consuming and computationally expensive task that involves training several Restricted Boltzmann Machines (RBMs) upholding considerable efforts. ∙ The adaptive step-size procedure for tuning the learning rate has been incorporated in the learning model with excelling results. 49
  • 55.
    deep models characteristics ∙Biological Plausibility ∙ DBNs are effective in a wide range of ML problems. ∙ Creating a Deep Belief Network (DBN) model is a time consuming and computationally expensive task that involves training several Restricted Boltzmann Machines (RBMs) upholding considerable efforts. ∙ The adaptive step-size procedure for tuning the learning rate has been incorporated in the learning model with excelling results. ∙ Graphics Processing Units (GPU) can reduce significantly the convergence time for the data intensive tasks in DBNs 49
  • 56.
    Bengio, Y. (2009). Learningdeep architectures for AI. Foundations and Trends in Machine Learning, 2(1):1–127. Carreira-Perpiñán, M. A. and Hinton, G. E. (2005). On contrastive divergence learning. In Proceedings of the 10th International Workshop on Artificial Intelligence and Statistics (AISTATS 2005), pages 33–40. Hinton, G. E. (2010). A practical guide to training restricted Boltzmann machines. Technical report, Department of Computer Science, University of Toronto. Larochelle, H., Erhan, D., Courville, A., Bergstra, J., and Bengio, Y. (2007). 49
  • 57.
    An empirical evaluationof deep architectures on problems with many factors of variation. In Proceedings of the 24th international conference on Machine learning (ICML 2007), pages 473–480. ACM. Roux, N. L. and Bengio, Y. (2008). Representational power of restricted Boltzmann machines and deep belief networks. Neural Computation, 20(6):1631–1649. Roux, N. L. and Bengio, Y. (2010). Deep belief networks are compact universal approximators. Neural Computation, 22(8):2192–2207. 50
  • 58.
  • 59.
    deep learning Algorithms andApplications Bernardete Ribeiro, bribeiro@dei.uc.pt June 24, 2015 University of Coimbra, Portugal INIT/AERFAI Summer School on Machine Learning, Benicassim 22-26 June 2015