[DL輪読会]Recent Advances in Autoencoder-Based Representation Learning

1
Recent Advances in Autoencoder-Based 
Representation Learning
Presenter:Tatsuya Matsushima @__tmats__ , Matsuo Lab

Recent Advances in Autoencoder-Based Representation Learning
• https://arxiv.org/abs/1812.05069 (Submitted on 12 Dec 2018)
• Michael Tschannen, Olivier Bachem, Mario Lucic
• ETH Zurich, Google Brain
• NeurIPS 2018 Workshop (Bayesian Deep Learning)
• http://bayesiandeeplearning.org/
• 19 3 accept
•
•
• ( …)
※
2

TL; DR
•
•
• meta-prior
• ( )
• Rate-Distortion
3

• (SRL)
• [DL ]  
https://www.slideshare.net/DeepLearningJP2016/dl-124128933
• SRL VAE VAE
4

VAE
Variational Autoencoder (VAE) [Kingma+ 2014a]
•
• KL (ELBO)
• ELBO (VAE loss )
6
ℒVAE(θ, ϕ) = 𝔼 ̂p(x) [ 𝔼qϕ(z|x) [−log pθ(x|z)]] + 𝔼 ̂p(x) [
DKL (qϕ(z|x)∥p(z))]
※ VAE ELBO
𝔼 ̂p(x) [−log pθ(x)] = ℒVAE(θ, ϕ) − 𝔼 ̂p(x) [
DKL (qϕ(z|x)∥pθ(z|x))]
−ℒVAE 𝔼 ̂p(x) [−log pθ(x)]
ℒVAE
̂p(x)

VAE
VAE loss
• 1 reparametrization trick
• 2 closed-form
• ,
closed-form
•
7
ℒVAE(θ, ϕ) = 𝔼 ̂p(x) [ 𝔼qϕ(z|x) [−log pθ(x|z)]] + 𝔼 ̂p(x) [
z(i)
∼ qϕ(z|x(i)
)
qϕ(z|x) = 𝒩
(
μϕ(x), diag (σϕ(x))) p(z) = 𝒩(0,I)

f-
• f-  
 
 
• KL divergence
• density-ratio trick f-
• GAN
8
f f(1) = 0 px py
Df (px∥py) =
∫
f
(
px(x)
py(x) )
py(x)dx
f(t) = t log t
Df (px∥py) = DKL (px∥py)
px py

GAN Density-ratio Trick KL
•
•
• 2
• Discriminator
•  
 
• i.i.d
9
c ∈ {0,1}px py
px(x) = p(x|c = 1) py(x) = p(x|c = 0)
Sη
px(x)
px(x)
py(x)
=
p(x|c = 1)
p(x|c = 0)
=
p(c = 1|x)
p(c = 0|x)
≈
Sη(x)
1 − Sη(x)
px N
DKL (px∥py) ≈
1
N
N
∑
i=1
log
(
Sη (x(i)
)
1 − Sη (x(i)
))

Maximum Mean Discrepancy (MMD)
MMD
• embedding
• ) MMD
•
10
k : 𝒳 → 𝒳 ℋ
φ : 𝒳 → ℋ px(x)
MMD (px, py) = 𝔼x∼px
[φ(x)] − 𝔼y∼py
[φ(y)]
2
ℋ
py(x)
𝒳 = ℋ = ℝd φ(x) = x
MMD (px, py) = μpx
− μpy
2
2
φ

Meta-Prior
Meta-prior [Bengio+ 2013]
•
•
•
• But
• →meta-prior
12

Meta-Prior [Bengio+ 2013]
Disentanglement
•
• )
•
•
• ) ( )
13

Meta-Prior [Bengio+ 2013]
•
•
•
•
14

Meta-Prior
( )  
• meta-prior
15
… 
( )

Meta-Prior
• disentangle
•
• )
16

VAE
19
ℒVAE(θ, ϕ) + λ1 𝔼 ̂p(x) [
R1 (qϕ(z|x))]
+ λ2R2 (qϕ(z))
Optional

VAE
• aggregate ( )
• divergence
20
aggregate  
( )  
qϕ(z)

Disentanglement
disentangle
•
• loss
21
v w
x ∼ p(x|v, w)
p(v|x) =
∏
j
p (vj |x)
qϕ(z|x) v

Disentanglement
Disentangle
•
• disentangle disentangle
• ( disentangle )
• [Locatello+ 2018]
•
• (a) ELBO
• (b) x z
• (c)
22

(a) ELBO
β-VAE [Higgins+ 2017]
• VAE Loss 
 
 
2
•
23
ℒVAE(θ, ϕ) = 𝔼 ̂p(x) [ 𝔼qϕ(z|x) [−log pθ(x|z)]] + 𝔼 ̂p(x) [DKL (qKL(q|x)∥p(z))]
ℒβ−VAE(θ, ϕ) = ℒVAE(θ, ϕ) + λ1 𝔼 ̂p(x) [
qϕ(z|x) p(z)
: [Higgins+ 2017]

(b) x z
VAE Loss 
 
2
•  
aggregate ( ) KL [Hoffman+ 2016]
• FactorVAE[Kim+ 2018]
• β-TCVAE[Chen+ 2018] InfoVAE[Zhao+ 2017a] DIP-VAE[Kumar+ 2018]
24
ℒVAE(θ, ϕ) = 𝔼 ̂p(x) [ 𝔼qϕ(z|x) [−log pθ(x|z)]] + 𝔼 ̂p(x) [DKL (qKL(q|x)∥p(z))]
𝔼 ̂p(x) [
= Iqϕ
(x; z) + DKL (qϕ(z)∥p(z))
x z Iqϕ
(x; z)
qϕ(z) p(z)

(b) x z
Factor VAE [Kim+ 2018]
• βVAE loss  
• toral correlation 
 
 
• discriminator density ratio trick
• [DL ]Disentangling by Factorising 
https://www.slideshare.net/DeepLearningJP2016/dldisentangling-by-factorising
25
ℒβ−VAE DKL (qϕ(z)∥p(z))
Iqϕ
(x; z)
TC (qϕ(z)) = DKL qϕ(z)∥
∏
j
qϕ (zj)
ℒFactorVAE(θ, ϕ) = ℒVAE(θ, ϕ) + λ2 TC (qϕ(z))

(c)
HSIC-VAE [Lopez+ 2018]
• Hilbert-Schmidt independence criterion (HSIC) [Gretton+2005]  
• HSIC ( AppendixA )
•  
•
HFVAE [Esmaeili+ 2018]
26
zG = {zk}k∈G
ℒHSIC−VAE(θ, ϕ) = ℒVAE(θ, ϕ) + λ2HSIC
(
qϕ (zG1), qϕ (zG2))
s
HSIC (qϕ(z), p(s))
p(s)

PixelGAN-AE [Makhzani+ 2017]
• PixelCNN[van den Oord+ 2016]  
•
• VAE loss KL  
 
 
• KL GAN
VIB[Alemi+ 2016]  
Information dropout[Achille+ 2018] 27
ℒPixelGAN−AE(θ, ϕ) = ℒVAE(θ, ϕ) − Iqϕ
(x; z)
𝔼 ̂p(x) [
= Iqϕ
(x; z) + DKL (qϕ(z)∥p(z))
Iqϕ
(x; z)
DKL (qϕ(z)∥p(z)) : [Makhzani+ 2017]

Variational Fair Autoencoder (VFAE) [Louizos+ 2016]
•
• VAE loss MMD
•
• MMD HSIC HSIC-VAE[Lopez+ 2018]
• 2 VFAE[Louizos+ 2016] HSIC-VAE [Lopez+ 2018]  
Fader Network[Lample+ 2017]  
DC-IGN[Kulkarni+ 2015] 28
q(z|s = k)
s
s
s z
ℒVAEq(z|s = k′)
ℒVFAE(θ, ϕ) = ℒVAE + λ2
K
∑
ℓ=2
MMD (qϕ(z|s = ℓ), qϕ(z|s = 1))
qϕ(z|s = ℓ) =
∑
i:s(i)=ℓ
1
{i : s(i) = ℓ}
qϕ(z|x(i)
, s(i)
)

• )
30
H:
A:
N:
C: Categorical
L: Learned prior

VAE
M2 [Kingma+ 2014b]
•
•
• loss  
• M1 (M1+M2 )
•
• DL Hacks Semi-supervised Learning with Deep Generative Models 
https://www.slideshare.net/YuusukeIwasawa/dl-hacks2015-0421iwasawa
• Semi-Supervised Learning with Deep Generative Models pixyz  
https://qiita.com/kogepan102/items/22b685ce7e9a51fbab98
31
qϕ(z, y|x) = qϕ(z|y, x)qϕ(y|x)
x z y
x
qϕ(z, y|x)
qϕ(z|y, x) ℒVAEy

VLAE
Varational Lossy Autoencoder (VLAE) [Chen+ 2017]
•  
•  
• )  
 
 
 
 
PixelVAE[Gulrajani+ 2017]  
LadderVAE[Sønderby+ 2016] VLaAE[Zhao+ 2017b] 32
pθ(x|z) z
z
pθ(x|z) W(j)
pθ(x|z) =
∏
j
pθ (xj |z, xW( j))
j

meta-prior
• meta-prior
• ) MNIST  
) (SVAE) [Johnson+ 2016]
34
p(z)
N:
C: Categorical
M: mixture
G:
L; Learned Prior

• Denoising Autoencoder (DAE) [Vincent+ 2008]
• [Yingzhen+ 2018] [Hsieh+2018]
• [Villegas+ 2017] [Denton+ 2017] [Fraccaro+ 2017]
37

discriminator
•
• Adversarially Learned Inference (ALI) [Dumoulin+ 2017]
• Bidirectional GAN (BiGAN) [Donahue+ 2017]
38
qϕ(z|x) pθ(x|z)
pθ(x|z)p(z) qϕ(z|x) ̂p(x)
: [Dumoulin+ 2017]
: [Donahue+ 2017]

Rate-Distortion-Usefulness Tradeoff
39

Rate-Distortion Tradeoff
meta-prior
• ) βVAE [Higgins+ 2017]
FaderNetwork[Lample+ 2017]
”Rate-Distortion Tradeoff”[Alemi+ 2018a]
40

Rate-Distortion Tradeoff [Alemi+ 2018a]
• Rate Distortion )
• ELBO
• Rate  
•
• [Alemi+ 2018a] Rate  
•
42
H − D ≤ R
: [Alemi+ 2018a]
D = H − R
min
ϕ,θ
D + |σ − R|
σ

Rate
• ( )
• )
•
• )  
43
z
z

• 3 ”usefulness”
•
•  
R-D usefulness  
44

Usefulness
•
•
•
• [Alemi+ 2018b]  
….?( )
45
Dy = −
∬
p(x, y)qϕ(z|x)log pθ(y|z)dxdydz = 𝔼p(x,y) [ 𝔼qϕ(z|x) [−log pθ(y|z)]]
y
R − Dy

• meta-prior  
• ( )
•
• supervision
• Rate-Distortion
• “usefulness”
47

• Rate-Distortion-Usefulness
• z
ex) GQN
• Meta-Prior
• meta-learning
• [DL ]Meta-Learning Probabilistic Inference for Prediction  
https://www.slideshare.net/DeepLearningJP2016/dlmetalearning-probabilistic-inference-for-
prediction-126167192
• usefulnes ( )
•
• Pixyz Pixyzoo ( )
48

Pixyz & Pixyzoo
Pixyz https://github.com/masa-su/pixyz
• (Pytorch )
•  
 
Pixyzoo https://github.com/masa-su/pixyzoo
• Pixyz
• GQN VIB
• [DLHacks]PyTorch, Pixyz Generative Query Network  
https://www.slideshare.net/DeepLearningJP2016/dlhackspytorch-pixyzgenerative-query-
network-126329901
49

References
[Achille+ 2018] A. Achille and S. Soatto, “Information dropout: Learning optimal representations through noisy computation,”
IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018. https://ieeexplore.ieee.org/document/8253482
[Alemi+ 2016] A. A. Alemi, I. Fischer, J. V. Dillon, and K. Murphy, “Deep variational information bottleneck,” in International
Conference on Learning Representations, 2016. https://openreview.net/forum?id=HyxQzBceg
[Alemi+ 2018a] A. Alemi, B. Poole, I. Fischer, J. Dillon, R. A. Saurous, and K. Murphy, “Fixing a broken ELBO,” in Proc. of the
International Conference on Machine Learning, 2018, pp. 159–168. http://proceedings.mlr.press/v80/alemi18a.html
[Alemi+ 2018b] A. A. Alemi and I. Fischer, “TherML: Thermodynamics of machine learning,” arXiv:1807.04162, 2018. https://
arxiv.org/abs/1807.04162
[Bengio+ 2013] Y. Bengio, A. Courville, and P. Vincent, “Representation learning: A review and new perspectives,” IEEE
Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 8, pp. 1798–1828, 2013. https://ieeexplore.ieee.org/
document/6472238
[Chen+ 2017] X. Chen, D. P. Kingma, T. Salimans, Y. Duan, P. Dhariwal, J. Schulman, I. Sutskever, and P. Abbeel, “Variational
lossy autoencoder,” in International Conference on Learning Representations, 2017. https://openreview.net/forum?
id=BysvGP5ee
[Chen+ 2018] T. Q. Chen, X. Li, R. Grosse, and D. Duvenaud, “Isolating sources of disentanglement in variational
autoencoders,” in Advances in Neural Information Processing Systems, 2018. http://papers.nips.cc/paper/7527-isolating-
sources-of-disentanglement-in-variational-autoencoders
51

[Denton+ 2017] E. L. Denton and V. Birodkar, “Unsupervised learning of disentangled representations from video,” in Advances
in Neural Information Processing Systems, 2017, pp. 4414–4423. https://papers.nips.cc/paper/7028-unsupervised-learning-of-
disentangled-representations-from-video
[Donahue+ 2017] J. Donahue, P. Krahenb ¨ uhl, and T. Darrell, “Adversarial feature learning,” in ¨ International Conference on
Learning Representations, 2017. https://openreview.net/forum?id=BJtNZAFgg
[Dumoulin+ 2017] V. Dumoulin, I. Belghazi, B. Poole, O. Mastropietro, A. Lamb, M. Arjovsky, and A. Courville, “Adversarially
learned inference,” in International Conference on Learning Representations, 2017. https://openreview.net/forum?id=B1ElR4cgg
[Dupont 2018] E. Dupont, “Learning disentangled joint continuous and discrete representations,” in Advances in Neural
Information Processing Systems, 2018. http://papers.nips.cc/paper/7351-learning-disentangled-joint-continuous-and-discrete-
representations
[Esmaeili+ 2018] B.Esmaeili,H.Wu,S.Jain,A.Bozkurt,N.Siddharth,B.Paige,D.H.Brooks,J.Dy,andJ.-W. van de Meent, “Structured
disentangled representations,” arXiv:1804.02086, 2018. https://arxiv.org/abs/1804.02086
[Fraccaro+ 2017] M. Fraccaro, S. Kamronn, U. Paquet, and O. Winther, “A disentangled recognition and nonlinear dynamics
model for unsupervised learning,” in Advances in Neural Information Processing Systems, 2017, pp. 3601–3610. https://
papers.nips.cc/paper/6951-a-disentangled-recognition-and-nonlinear-dynamics-model-for-unsupervised-learning
[Gretton+ 2005] A. Gretton, O. Bousquet, A. Smola, and B. Scho ̈lkopf, “Measuring statistical dependence with Hilbert-Schmidt
norms,” in International Conference on Algorithmic Learning Theory. Springer, 2005, pp. 63–77. https://link.springer.com/chapter/
10.1007/11564089_7
[Gulrajani+ 2017] I. Gulrajani, K. Kumar, F. Ahmed, A. A. Taiga, F. Visin, D. Vazquez, and A. Courville, “PixelVAE: A latent
variable model for natural images,” in International Conference on Learning Representations, 2017. https://openreview.net/
forum?id=BJKYvt5lg
References
52

[Higgins+ 2017] I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, S. Mohamed, and A. Lerchner, “beta-VAE:
Learning basic visual concepts with a constrained variational framework,” in International Conference on Learning
Representations, 2017. https://openreview.net/forum?id=Sy2fzU9gl
[Hoffman+ 2016] M. D. Hoffman and M. J. Johnson, “Elbo surgery: yet another way to carve up the variational evidence lower
bound,” in Workshop in Advances in Approximate Bayesian Inference, NIPS, 2016. http://approximateinference.org/accepted/
HoffmanJohnson2016.pdf
[Hsieh+2018] J.-T. Hsieh, B. Liu, D.-A. Huang, L. Fei-Fei, and J. C. Niebles, “Learning to decompose and disentangle
representations for video prediction,” in Advances in Neural Information Processing Systems, 2018. http://papers.nips.cc/paper/
7333-learning-to-decompose-and-disentangle-representations-for-video-prediction
[Johnson+ 2016] M. Johnson, D. K. Duvenaud, A. Wiltschko, R. P. Adams, and S. R. Datta, “Composing graphical models with
neural networks for structured representations and fast inference,” in Advances in Neural Information Processing Systems,
2016, pp. 2946–2954. https://papers.nips.cc/paper/6379-composing-graphical-models-with-neural-networks-for-structured-
representations-and-fast-inference
[Kim+ 2018] H. Kim and A. Mnih, “Disentangling by factorising,” in Proc. of the International Conference on Machine Learning,
2018, pp. 2649–2658. http://proceedings.mlr.press/v80/kim18b.html
[Kingma+ 2014a] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” in International Conference on Learning
Representations, 2014. https://openreview.net/forum?id=33X9fd2-9FyZd
[Kingma+ 2014b] D. P. Kingma, S. Mohamed, D. J. Rezende, and M. Welling, “Semi-supervised learning with deep generative
models,” in Advances in Neural Information Processing Systems, 2014, pp. 3581–3589. https://papers.nips.cc/paper/5352-semi-
supervised-learning-with-deep-generative-models
References
53

[Kulkarni+ 2015] T.D.Kulkarni, W.F.Whitney, P.Kohli, and J.Tenenbaum, “Deep convolutional inverse graphics network,” in
Advances in Neural Information Processing Systems, 2015, pp. 2539–2547. https://papers.nips.cc/paper/5851-deep-
convolutional-inverse-graphics-network
[Kumar+ 2018] A. Kumar, P. Sattigeri, and A. Balakrishnan, “Variational inference of disentangled latent concepts from
unlabeled observations,” in International Conference on Learning Representations, 2018. https://openreview.net/forum?
id=H1kG7GZAW
[Lample+ 2017] G. Lample, N. Zeghidour, N. Usunier, A. Bordes, L. Denoyer et al., “Fader networks: Manipulating images by
sliding attributes,” in Advances in Neural Information Processing Systems, 2017, pp. 5967–5976. https://papers.nips.cc/paper/
7178-fader-networksmanipulating-images-by-sliding-attributes
[Locatello+ 2018] F. Locatello, S. Bauer, M. Lucic, S. Gelly, B. Scho ̈lkopf, and O. Bachem, “Challenging common assumptions
in the unsupervised learning of disentangled representations,” arXiv:1811.12359, 2018. https://arxiv.org/abs/1811.12359
[Lopez+ 2018] R. Lopez, J. Regier, M. I. Jordan, and N. Yosef, “Information constraints on auto-encoding variational bayes,” in
Advances in Neural Information Processing Systems, 2018. https://papers.nips.cc/paper/7850-information-constraints-on-auto-
encoding-variational-bayes
[Louizos+ 2016] C. Louizos, K. Swersky, Y. Li, M. Welling, and R. Zemel, “The variational fair autoencoder,” in International
Conference on Learning Representations, 2016. https://arxiv.org/abs/1511.00830
[Makhzani+ 2017] A. Makhzani and B. J. Frey, “PixelGAN autoencoders,” in Advances in Neural Information Processing
Systems, 2017, pp. 1975–1985. https://papers.nips.cc/paper/6793-pixelgan-autoencoders
[Sønderby+ 2016] C. K. Sønderby, T. Raiko, L. Maaløe, S. K. Sønderby, and O. Winther, “Ladder variational autoencoders,” in
Advances in Neural Information Processing Systems, 2016, pp. 3738–3746. https://papers.nips.cc/paper/6275-ladder-
variational-autoencoders
References
54

[van den Oord+ 2016] A. van den Oord, N. Kalchbrenner, L. Espeholt, O. Vinyals, and A. Graves, “Conditional image
generation with PixelCNN decoders,” in Advances in Neural Information Processing Systems, 2016, pp. 4790–4798. https://
papers.nips.cc/paper/6527-conditional-image-generation-with-pixelcnn-decoders
[van den Oord+ 2017] A. van den Oord, O. Vinyals et al., “Neural discrete representation learning,” in Advances in Neural
Information Processing Systems, 2017, pp. 6306–6315. https://papers.nips.cc/paper/7210-neural-discrete-representation-
learning
[Villegas+ 2017] R. Villegas, J. Yang, S. Hong, X. Lin, and H. Lee, “Decomposing motion and content for natural video
sequence prediction,” in International Conference on Learning Representations, 2017. https://openreview.net/forum?
id=rkEFLFqee
[Vincent+ 2008] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol, “Extracting and composing robust features with
denoising autoencoders,” in Proc. of the International Conference on Machine Learning, 2008, pp. 1096–1103. https://
dl.acm.org/citation.cfm?id=1390294
[Yingzhen+ 2018] L. Yingzhen and S. Mandt, “Disentangled sequential autoencoder,” in Proc. of the International Conference
on Machine Learning, 2018, pp. 5656–5665. http://proceedings.mlr.press/v80/yingzhen18a.html
[Zhao+ 2017a] S.Zhao, J.Song, and S.Ermon,“InfoVAE: Information maximizing variational autoencoders,” arXiv:1706.02262,
2017. https://arxiv.org/abs/1706.02262
[Zhao+ 2017b] S. Zhao, J. Song, and S. Ermon, “Learning hierarchical features from deep generative models,” in Proc. of the
International Conference on Machine Learning, 2017, pp. 4091–4099. http://proceedings.mlr.press/v70/zhao17c.html
References
55

[DL輪読会]Recent Advances in Autoencoder-Based Representation Learning

More Related Content

What's hot

Similar to [DL輪読会]Recent Advances in Autoencoder-Based Representation Learning

More from Deep Learning JP

Recently uploaded

[DL輪読会]Recent Advances in Autoencoder-Based Representation Learning