Generative models : VAE and GANs
Jinhwan Suk
Department of Mathematical Science, KAIST
May 7, 2020
Jinhwan Suk (Department of Mathematical Science, KAIST)Generative models : VAE and GANs May 7, 2020 1 / 29
Contents
Introduction to Information Theory
What is generative Model?
Example 1 : VAE
Example 2 : GANs
Jinhwan Suk (Department of Mathematical Science, KAIST)Generative models : VAE and GANs May 7, 2020 2 / 29
Introduction to Information Theory
Introduction
Information theory is a branch of applied mathematics.
Originally proposed by Claude Shannon in 1948.
A key measure in information theory is entropy.
Basic Intuition
Learning that an unlikely event has occurred is more informative than
learning that a likely event has occurred.
• Message 1 : ”the sun rose this morning”
• Message 2 : ”there was a solar eclipse this morning”
Message 2 is much more informative than Message 1.
Jinhwan Suk (Department of Mathematical Science, KAIST)Generative models : VAE and GANs May 7, 2020 3 / 29
Introduction to Information Theory
Formalization of Intuitions
Likely events should have low information. And, events that are
guaranteed to happen should have no information content whatsoever.
Less likely events should have higher information content.
Independent events should have additive information. e.g. a tossed
coin has come up as head twice.
Properties of Information function I(x) = IX (x)
I(x) is a function of P(x).
I(x) is inversely proportional to P(x).
I(x) = 0 if P(x) = 1.
I(X1 = x1, X2 = x2) = I(X1 = x1) + I(X2 = x2)
Jinhwan Suk (Department of Mathematical Science, KAIST)Generative models : VAE and GANs May 7, 2020 4 / 29
Introduction to Information Theory
Formalization of Intuitions
Let X1 and X2 be independent random variables with
P(X1 = x1) = p1 and P(X2 = x2) = p2
Then, we have
I(X1 = x1, X2 = x2) = I(P(x1, x2))
= I(P(X1 = x1)P(X2 = x2))
= I(p1p2)
= I(p1) + I(p2)
Thus, I(p) = k log p for some k < 0.
Jinhwan Suk (Department of Mathematical Science, KAIST)Generative models : VAE and GANs May 7, 2020 5 / 29
Introduction to Information Theory
Measure of Information
Definition (Self-Informaiton)
self-information of an event X = x is
I(x) = − log P(x).
self-information is a measure of information(or, uncertainity, surprise) of a
certain single event.
Definition (Shannon-Entropy)
Shannon entropy is the expected amount of information in an entire
probability distribution defined by
H(X) = EX∼P[I(X)] = −EX∼P[log P(X)].
Jinhwan Suk (Department of Mathematical Science, KAIST)Generative models : VAE and GANs May 7, 2020 6 / 29
Introduction to Information Theory
Density Estimation
In classification problem, we usually want to describe P(Y |X) for
each input X.
So many models(cθ) aim to estimate conditional probability
distribution by choosing optimal ˆθ such that
cˆθ(x)[i] = P(Y = yi |X = x),
like softmax classifier or Logistic regresor.
So we can regard the classification problem as the regression problem
such that minimizes
R(cθ) = EX [L(cθ(X), P(Y |X))]
(L measures distance between two probability distribution)
Jinhwan Suk (Department of Mathematical Science, KAIST)Generative models : VAE and GANs May 7, 2020 7 / 29
Introduction to Information Theory
Two ways of measuring distance between probability distributions
Definition (Total variation)
The total variation distance between two probability measures Pθ and
Pθ∗ is defined by
TV (Pθ, Pθ∗ ) = max
A:events
|Pθ(A) − Pθ∗ (A)|.
Definition (Kullback-Leibler divergence)
The KL divergence between two probability measures Pθ and Pθ∗ is
defined by
DKL(Pθ||Pθ∗ ) = EX∼Pθ
[log Pθ(X) − log Pθ∗ (X)],
Jinhwan Suk (Department of Mathematical Science, KAIST)Generative models : VAE and GANs May 7, 2020 8 / 29
Introduction to Information Theory
Cross-Entropy
We usually use KL-divergence because finding estimator of θ is much
easier in KL-divergence.
DKL(Pθ||Pθ∗ ) = EX∼Pθ
[log P(X) − log Pθ∗ (X)]
= EX∼Pθ
[log Pθ(X)] − EX∼Pθ
[log Pθ∗ (X)]
= constant − EX∼Pθ
[log Pθ∗ (X)]
Hence, minimizing the KL divergence is equivalent to minimizing
−EX∼Pθ
[log Pθ∗ (x)], whose name is cross-entropy. And the estimation
using estimator that minimizes KL divergence or Cross-entropy is called
maximum likelihood principle.
Jinhwan Suk (Department of Mathematical Science, KAIST)Generative models : VAE and GANs May 7, 2020 9 / 29
Introduction to Information Theory
Maximum Likelihood Estimation
Pθ∗ is distribution of population and we want to choose proper estimator ˆθ
by minimizing the distance between Pθ∗ and Pˆθ,
DKL(Pθ∗ || Pˆθ) = const − EX∼Pθ∗ [log Pˆθ(X)]
If X1, X2, ..., Xn are random samples, then by LLN,
EX∼Pθ∗ [log Pˆθ(x)] ∼
1
n
n
i=1
log Pˆθ(Xi )
∴ DKL(Pθ∗ || Pˆθ) = const −
1
n
n
i=1
log Pˆθ(Xi )
Jinhwan Suk (Department of Mathematical Science, KAIST)Generative models : VAE and GANs May 7, 2020 10 / 29
Introduction to Information Theory
Maximum Likelihood Estimation
min
θ∈Θ
DKL(Pθ∗ || Pˆθ) ⇐⇒ min
θ∈Θ
−
1
n
n
i=1
log Pˆθ(Xi )
⇐⇒ max
θ∈Θ
1
n
n
i=1
log Pˆθ(Xi )
⇐⇒ max
θ∈Θ
n
i=1
log Pˆθ(Xi )
⇐⇒ max
θ∈Θ
n
i=1
Pˆθ(Xi )
This is the maximum likelihood principle.
Jinhwan Suk (Department of Mathematical Science, KAIST)Generative models : VAE and GANs May 7, 2020 11 / 29
Introduction to Information Theory
Return to Main Goal : Find an estimator ˆθ that minimizes
R(cθ) = EX [L(cθ(X), P(Y |X))].
Suppose that X1, X2, ..., Xn are i.i.d and cross-entropy is used for L.
EX [L(cθ(X), P(Y |X))] ∼
1
n
n
i=1
L(cθ(Xi ), P(Y |Xi ))
=
1
n
n
i=1
−EY |Xi ∼PYemp|Xi
[log cθ(Xi )]
=
1
n
n
i=1
− log{cθ(Xi )[Yi,true]}.
Jinhwan Suk (Department of Mathematical Science, KAIST)Generative models : VAE and GANs May 7, 2020 12 / 29
What is Generative Model?
Generative Model vs Discriminative model
A generative model is a statistical model of the joint distribution on
X × Y , P(X, Y )
A discriminative model is a model of the conditional probability of
the target given an observation x, P(Y |X = x).
In unsupervised learning, generative model usually means the
statistical model of P(X).
How can we estimate joint(conditional) distribution?
What do we obtain while estimating the probability distribution?
What can we do with generative model?
Jinhwan Suk (Department of Mathematical Science, KAIST)Generative models : VAE and GANs May 7, 2020 13 / 29
Example of Discriminative Model
Simple Linear Regression
Assumption : P(y|x) = N(α + βx, σ2), σ > 0 is known.
Jinhwan Suk (Department of Mathematical Science, KAIST)Generative models : VAE and GANs May 7, 2020 14 / 29
Concept of VAE
Goal : estimate population distribution using given observations.
Strong assumption on existence of latent variables, Z ∼ N(0, I).
X|Z ∼ N(f (Z; θ), σ2
∗ I))
X|Z ∼ Bernoulli(f (Z; θ))
Let Pemp be empirical distribution(assumption : Pemp ≈ Ppop)
arg min
θ
DKL(Pemp(X)||Pθ(X)) = arg min
θ
const − EX∼Pemp [log Pθ(X)]
= arg max
θ
EX∼Pemp [log Pθ(X)]
= arg max
θ
1
N
N
i=1
[log Pθ(Xi )]
Jinhwan Suk (Department of Mathematical Science, KAIST)Generative models : VAE and GANs May 7, 2020 15 / 29
Concept of VAE
Naive approach
Maximize Pθ(Xi ) w.r.t θ for each samples X1, X2, ..., Xn.
=⇒ But, Pθ(Xi ) is intractable.
Pθ(Xi ) =
Z
Pθ(Xi , z) dz
=
Z
Pθ(Xi |z)P(z) dz
∼
1
n
n
j=1
Pθ(Xi |Zj )P(Zj )
If we pick n large, then the approximation would be done quite well.
But for efficiency, we look for some other way to set n small.
Jinhwan Suk (Department of Mathematical Science, KAIST)Generative models : VAE and GANs May 7, 2020 16 / 29
Concept of VAE
ELBO
Use Monte Carlo method on
Pθ(Xi ) =
Z
Pθ(Xi , z) dz
=
Z
Pθ(z|Xi )P(Xi ) dz
∼
1
n
n
j=1
Pθ(Zj |Xi )P(Xi )
Pick Zj where Pθ(Zj |Xi ) is high ⇒ intractable
Set Qφ(Z|X) ∼ N(µφ(X), σφ(X)2) to estimate Pθ(Zj |Xi ).
Jinhwan Suk (Department of Mathematical Science, KAIST)Generative models : VAE and GANs May 7, 2020 17 / 29
Concept of VAE
DKL(Qφ(Z|X)||Pθ(Z|X)) = EZ∼Q|X [log Qφ(Z|X) − log Pθ(Z|X)]
= EZ∼Q|X [log Qφ(Z|X) − log Pθ(X, Z)]
+ log Pθ(X)
We want to maximize log Pθ(X) and minimize
DKL(Qφ(Z|X)||Pθ(Z|X)) at once.
Define L(θ, φ, X) = EZ∼Q|X [log Pθ(X, Z) − log Qφ(Z|X)]
log Pθ(X) − DKL(Qφ(Z|X)||Pθ(Z|X)) = L(θ, φ, X)
Jinhwan Suk (Department of Mathematical Science, KAIST)Generative models : VAE and GANs May 7, 2020 18 / 29
Concept of VAE
ELBO
L(θ, φ, X) = EZ∼Q|X [log Pθ(X, Z) − log Qφ(Z|X)]
= EZ∼Q|X [log Pθ(X|Z) + log Pθ(Z) − log Qφ(Z|X)]
= EZ∼Q|X [log Pθ(X|Z)] − DKL(Qφ(Z|X)||Pθ(Z))
DKL(Qφ(Z|X)||Pθ(Z)) can be integrated analytically
DKL(Qφ(Z|X)||Pθ(Z)) =
1
2
(1 + log σφ(X)2
) − µφ(X)2
− σφ(X)2
EZ∼Q|X [log Pθ(X|Z)] requires estimation by sampling.
EZ∼Q|X [log Pθ(X|Z)] ≈
1
n
n
i=1
log Pθ(X|zi )
=
1
n
n
i=1
[−
(X − f (z1; θ))2
2σ2
− log(
√
2πσ2
)]
Jinhwan Suk (Department of Mathematical Science, KAIST)Generative models : VAE and GANs May 7, 2020 19 / 29
Concept of VAE
ELBO
Maximizing L(θ, φ, X) is equal to minimizing
1
n
n
i=1
(X − f (z1; θ))2
2σ2
+
1
2
(1 + log σφ(X)2
) − µφ(X)2
− σφ(X)2
Jinhwan Suk (Department of Mathematical Science, KAIST)Generative models : VAE and GANs May 7, 2020 20 / 29
Concept of VAE
Problem of the above formulation
Since Pθ(Xi ) ≈ 1
n
n
j=1 P(Xi |zj )P(zj ) and we use n = 1,
log Pθ(Xi ) ≈ log[P(Xi |z1)P(z1)]
= log P(Xi |z1) + log P(z1)
= log
1
√
2πσ
exp(−
(Xi − f (z1; θ))2
2σ2
) + log
1
√
2π
exp(−
z2
1
2
)
= −
(Xi − f (z1; θ))2
2σ2
+ const.
Therfore, maximizing log Pθ(Xi ) is transformed to
minimizing −(Xi −f (z1;θ))2
2σ2 .
Jinhwan Suk (Department of Mathematical Science, KAIST)Generative models : VAE and GANs May 7, 2020 21 / 29
Concept of VAE
Problem of the above formulation
To address this problem, we should set σ very small
Jinhwan Suk (Department of Mathematical Science, KAIST)Generative models : VAE and GANs May 7, 2020 22 / 29
Concept of GANs
Introduction
Goal : estimate population distribution using given observations.
Strong assumption on existence of latent variables, Z ∼ PZ .
Define G(z; θg ) which is mapping to data space,
Pg (X = x) = PZ (G(Z) = x)
Define D(x; θd ) that represents the probability that x is real.
min
G
max
D
V (D, G) = Ex∼Pemp [log D(x)] + E[log(1 − D(G(z)))]
What is difference between VAE and GANs??
⇒ GANs do not formulate about P(X) explicitly.
⇒ But we can show it has a global optimum Pg = Pemp
⇒ So we can say that GANs is generative model.
Jinhwan Suk (Department of Mathematical Science, KAIST)Generative models : VAE and GANs May 7, 2020 23 / 29
Concept of GANs
Algorithm
V (D, G) = Ex∼Pemp [log D(x)] + E[log(1 − D(G(z)))]
∼
1
m
m
i=1
log D(xi ) +
1
m
m
j=1
log(1 − D(G(zj )))
1 Sample minibatch of m noise samples and minibatch of m examples.
2 update the discriminator by ascending its stochastic gradient :
1
m
m
i=1
θd
[log D(xi ) + log(1 − D(G(zi )))]
3 Sample minibatch of m noise samples.
4 Update the generator by descending its stochastic gradient :
1
m
m
i=1
θg [log(1 − D(G(zi )))]
Jinhwan Suk (Department of Mathematical Science, KAIST)Generative models : VAE and GANs May 7, 2020 24 / 29
Concept of GANs
Global optimality of Pg = Pemp
Proposition 1
For G fixed, the optimal discriminator D is
D∗
G (x) =
Pemp(x)
Pemp(x) + Pg (x)
Proposition 2
The global minimum of the virtual training criterion is achieved if and only
if Pg = Pemp.
Jinhwan Suk (Department of Mathematical Science, KAIST)Generative models : VAE and GANs May 7, 2020 25 / 29
proof of Proposition 1 :
Let generator G be fixed and define Ax = {z ∈ Z : G(z) = x}.
V (G, Dθ) =
x∈X
log(Dθ(x))Pemp(x) dx +
z∈Z
log(1 − Dθ(G(z)))PZ (z) dz
=
x∈X
log(Dθ(x))Pemp(x) dx +
x∈X z∈Ax
log(1 − Dθ(G(z)))PZ (z) dz dx
=
x∈X
log(Dθ(x))Pemp(x) dx +
x∈X
log(1 − Dθ(x))
z∈Ax
PZ (z) dz dx
=
x∈X
log(Dθ(x))Pemp(x) dx +
x∈X
log(1 − Dθ(x))Pg (x) dx
=
x∈X
log(Dθ(x))Pemp(x) + log(1 − Dθ(x))Pg (x) dx
Jinhwan Suk (Department of Mathematical Science, KAIST)Generative models : VAE and GANs May 7, 2020 26 / 29
proof of Proposition 1(continued) :
V (G, Dθ) achieves the minimum when
∂
∂θ [log(Dθ(x))Pemp(x) + log(1 − Dθ(x))Pg (x)] = 0 ∀x ∈ X.
⇔
∂
∂θ
Dθ(x)
Dθ(x) Pemp(x) −
∂
∂θ
Dθ(x)
1−Dθ(x) Pg (x) = 0 ∀x ∈ X
⇔ Dˆθ(x) =
Pemp(x)
Pemp(x)+Pg (x) ∀x ∈ X
Jinhwan Suk (Department of Mathematical Science, KAIST)Generative models : VAE and GANs May 7, 2020 27 / 29
proof of Proposition 2 :
min
G
max
D
V (G, D) = min
G
V (G, D∗
G )
= Ex∼Pemp
[log D∗
G (x)] + E[log(1 − D∗
G (G(z)))]
= Ex∼Pemp
[log D∗
G (x)] + Ex∼Pg
[log(1 − D∗
G (x))]
= Ex∼Pemp
log
Pemp(x)
Pemp(x) + Pg (x)
+ Ex∼Pg
log
Pg (x)
Pemp(x) + Pg (x)
= Ex∼Pemp
log Pemp(x) − log
Pemp(x) + Pg (x)
2
− log 2
+ Ex∼Pg log Pg (x) − log
Pemp(x) + Pg (x)
2
− log 2
= DKL(Pemp||
Pemp(x) + Pg (x)
2
) + DKL(Pg ||
Pemp(x) + Pg (x)
2
)
− 2 log 2
≥ −2 log 2
The equality holds if and only if Pemp =
Pemp(x)+Pg (x)
2 and Pg =
Pemp(x)+Pg (x)
2
Jinhwan Suk (Department of Mathematical Science, KAIST)Generative models : VAE and GANs May 7, 2020 28 / 29
Thank you
Jinhwan Suk (Department of Mathematical Science, KAIST)Generative models : VAE and GANs May 7, 2020 29 / 29

Generative models : VAE and GAN

  • 1.
    Generative models :VAE and GANs Jinhwan Suk Department of Mathematical Science, KAIST May 7, 2020 Jinhwan Suk (Department of Mathematical Science, KAIST)Generative models : VAE and GANs May 7, 2020 1 / 29
  • 2.
    Contents Introduction to InformationTheory What is generative Model? Example 1 : VAE Example 2 : GANs Jinhwan Suk (Department of Mathematical Science, KAIST)Generative models : VAE and GANs May 7, 2020 2 / 29
  • 3.
    Introduction to InformationTheory Introduction Information theory is a branch of applied mathematics. Originally proposed by Claude Shannon in 1948. A key measure in information theory is entropy. Basic Intuition Learning that an unlikely event has occurred is more informative than learning that a likely event has occurred. • Message 1 : ”the sun rose this morning” • Message 2 : ”there was a solar eclipse this morning” Message 2 is much more informative than Message 1. Jinhwan Suk (Department of Mathematical Science, KAIST)Generative models : VAE and GANs May 7, 2020 3 / 29
  • 4.
    Introduction to InformationTheory Formalization of Intuitions Likely events should have low information. And, events that are guaranteed to happen should have no information content whatsoever. Less likely events should have higher information content. Independent events should have additive information. e.g. a tossed coin has come up as head twice. Properties of Information function I(x) = IX (x) I(x) is a function of P(x). I(x) is inversely proportional to P(x). I(x) = 0 if P(x) = 1. I(X1 = x1, X2 = x2) = I(X1 = x1) + I(X2 = x2) Jinhwan Suk (Department of Mathematical Science, KAIST)Generative models : VAE and GANs May 7, 2020 4 / 29
  • 5.
    Introduction to InformationTheory Formalization of Intuitions Let X1 and X2 be independent random variables with P(X1 = x1) = p1 and P(X2 = x2) = p2 Then, we have I(X1 = x1, X2 = x2) = I(P(x1, x2)) = I(P(X1 = x1)P(X2 = x2)) = I(p1p2) = I(p1) + I(p2) Thus, I(p) = k log p for some k < 0. Jinhwan Suk (Department of Mathematical Science, KAIST)Generative models : VAE and GANs May 7, 2020 5 / 29
  • 6.
    Introduction to InformationTheory Measure of Information Definition (Self-Informaiton) self-information of an event X = x is I(x) = − log P(x). self-information is a measure of information(or, uncertainity, surprise) of a certain single event. Definition (Shannon-Entropy) Shannon entropy is the expected amount of information in an entire probability distribution defined by H(X) = EX∼P[I(X)] = −EX∼P[log P(X)]. Jinhwan Suk (Department of Mathematical Science, KAIST)Generative models : VAE and GANs May 7, 2020 6 / 29
  • 7.
    Introduction to InformationTheory Density Estimation In classification problem, we usually want to describe P(Y |X) for each input X. So many models(cθ) aim to estimate conditional probability distribution by choosing optimal ˆθ such that cˆθ(x)[i] = P(Y = yi |X = x), like softmax classifier or Logistic regresor. So we can regard the classification problem as the regression problem such that minimizes R(cθ) = EX [L(cθ(X), P(Y |X))] (L measures distance between two probability distribution) Jinhwan Suk (Department of Mathematical Science, KAIST)Generative models : VAE and GANs May 7, 2020 7 / 29
  • 8.
    Introduction to InformationTheory Two ways of measuring distance between probability distributions Definition (Total variation) The total variation distance between two probability measures Pθ and Pθ∗ is defined by TV (Pθ, Pθ∗ ) = max A:events |Pθ(A) − Pθ∗ (A)|. Definition (Kullback-Leibler divergence) The KL divergence between two probability measures Pθ and Pθ∗ is defined by DKL(Pθ||Pθ∗ ) = EX∼Pθ [log Pθ(X) − log Pθ∗ (X)], Jinhwan Suk (Department of Mathematical Science, KAIST)Generative models : VAE and GANs May 7, 2020 8 / 29
  • 9.
    Introduction to InformationTheory Cross-Entropy We usually use KL-divergence because finding estimator of θ is much easier in KL-divergence. DKL(Pθ||Pθ∗ ) = EX∼Pθ [log P(X) − log Pθ∗ (X)] = EX∼Pθ [log Pθ(X)] − EX∼Pθ [log Pθ∗ (X)] = constant − EX∼Pθ [log Pθ∗ (X)] Hence, minimizing the KL divergence is equivalent to minimizing −EX∼Pθ [log Pθ∗ (x)], whose name is cross-entropy. And the estimation using estimator that minimizes KL divergence or Cross-entropy is called maximum likelihood principle. Jinhwan Suk (Department of Mathematical Science, KAIST)Generative models : VAE and GANs May 7, 2020 9 / 29
  • 10.
    Introduction to InformationTheory Maximum Likelihood Estimation Pθ∗ is distribution of population and we want to choose proper estimator ˆθ by minimizing the distance between Pθ∗ and Pˆθ, DKL(Pθ∗ || Pˆθ) = const − EX∼Pθ∗ [log Pˆθ(X)] If X1, X2, ..., Xn are random samples, then by LLN, EX∼Pθ∗ [log Pˆθ(x)] ∼ 1 n n i=1 log Pˆθ(Xi ) ∴ DKL(Pθ∗ || Pˆθ) = const − 1 n n i=1 log Pˆθ(Xi ) Jinhwan Suk (Department of Mathematical Science, KAIST)Generative models : VAE and GANs May 7, 2020 10 / 29
  • 11.
    Introduction to InformationTheory Maximum Likelihood Estimation min θ∈Θ DKL(Pθ∗ || Pˆθ) ⇐⇒ min θ∈Θ − 1 n n i=1 log Pˆθ(Xi ) ⇐⇒ max θ∈Θ 1 n n i=1 log Pˆθ(Xi ) ⇐⇒ max θ∈Θ n i=1 log Pˆθ(Xi ) ⇐⇒ max θ∈Θ n i=1 Pˆθ(Xi ) This is the maximum likelihood principle. Jinhwan Suk (Department of Mathematical Science, KAIST)Generative models : VAE and GANs May 7, 2020 11 / 29
  • 12.
    Introduction to InformationTheory Return to Main Goal : Find an estimator ˆθ that minimizes R(cθ) = EX [L(cθ(X), P(Y |X))]. Suppose that X1, X2, ..., Xn are i.i.d and cross-entropy is used for L. EX [L(cθ(X), P(Y |X))] ∼ 1 n n i=1 L(cθ(Xi ), P(Y |Xi )) = 1 n n i=1 −EY |Xi ∼PYemp|Xi [log cθ(Xi )] = 1 n n i=1 − log{cθ(Xi )[Yi,true]}. Jinhwan Suk (Department of Mathematical Science, KAIST)Generative models : VAE and GANs May 7, 2020 12 / 29
  • 13.
    What is GenerativeModel? Generative Model vs Discriminative model A generative model is a statistical model of the joint distribution on X × Y , P(X, Y ) A discriminative model is a model of the conditional probability of the target given an observation x, P(Y |X = x). In unsupervised learning, generative model usually means the statistical model of P(X). How can we estimate joint(conditional) distribution? What do we obtain while estimating the probability distribution? What can we do with generative model? Jinhwan Suk (Department of Mathematical Science, KAIST)Generative models : VAE and GANs May 7, 2020 13 / 29
  • 14.
    Example of DiscriminativeModel Simple Linear Regression Assumption : P(y|x) = N(α + βx, σ2), σ > 0 is known. Jinhwan Suk (Department of Mathematical Science, KAIST)Generative models : VAE and GANs May 7, 2020 14 / 29
  • 15.
    Concept of VAE Goal: estimate population distribution using given observations. Strong assumption on existence of latent variables, Z ∼ N(0, I). X|Z ∼ N(f (Z; θ), σ2 ∗ I)) X|Z ∼ Bernoulli(f (Z; θ)) Let Pemp be empirical distribution(assumption : Pemp ≈ Ppop) arg min θ DKL(Pemp(X)||Pθ(X)) = arg min θ const − EX∼Pemp [log Pθ(X)] = arg max θ EX∼Pemp [log Pθ(X)] = arg max θ 1 N N i=1 [log Pθ(Xi )] Jinhwan Suk (Department of Mathematical Science, KAIST)Generative models : VAE and GANs May 7, 2020 15 / 29
  • 16.
    Concept of VAE Naiveapproach Maximize Pθ(Xi ) w.r.t θ for each samples X1, X2, ..., Xn. =⇒ But, Pθ(Xi ) is intractable. Pθ(Xi ) = Z Pθ(Xi , z) dz = Z Pθ(Xi |z)P(z) dz ∼ 1 n n j=1 Pθ(Xi |Zj )P(Zj ) If we pick n large, then the approximation would be done quite well. But for efficiency, we look for some other way to set n small. Jinhwan Suk (Department of Mathematical Science, KAIST)Generative models : VAE and GANs May 7, 2020 16 / 29
  • 17.
    Concept of VAE ELBO UseMonte Carlo method on Pθ(Xi ) = Z Pθ(Xi , z) dz = Z Pθ(z|Xi )P(Xi ) dz ∼ 1 n n j=1 Pθ(Zj |Xi )P(Xi ) Pick Zj where Pθ(Zj |Xi ) is high ⇒ intractable Set Qφ(Z|X) ∼ N(µφ(X), σφ(X)2) to estimate Pθ(Zj |Xi ). Jinhwan Suk (Department of Mathematical Science, KAIST)Generative models : VAE and GANs May 7, 2020 17 / 29
  • 18.
    Concept of VAE DKL(Qφ(Z|X)||Pθ(Z|X))= EZ∼Q|X [log Qφ(Z|X) − log Pθ(Z|X)] = EZ∼Q|X [log Qφ(Z|X) − log Pθ(X, Z)] + log Pθ(X) We want to maximize log Pθ(X) and minimize DKL(Qφ(Z|X)||Pθ(Z|X)) at once. Define L(θ, φ, X) = EZ∼Q|X [log Pθ(X, Z) − log Qφ(Z|X)] log Pθ(X) − DKL(Qφ(Z|X)||Pθ(Z|X)) = L(θ, φ, X) Jinhwan Suk (Department of Mathematical Science, KAIST)Generative models : VAE and GANs May 7, 2020 18 / 29
  • 19.
    Concept of VAE ELBO L(θ,φ, X) = EZ∼Q|X [log Pθ(X, Z) − log Qφ(Z|X)] = EZ∼Q|X [log Pθ(X|Z) + log Pθ(Z) − log Qφ(Z|X)] = EZ∼Q|X [log Pθ(X|Z)] − DKL(Qφ(Z|X)||Pθ(Z)) DKL(Qφ(Z|X)||Pθ(Z)) can be integrated analytically DKL(Qφ(Z|X)||Pθ(Z)) = 1 2 (1 + log σφ(X)2 ) − µφ(X)2 − σφ(X)2 EZ∼Q|X [log Pθ(X|Z)] requires estimation by sampling. EZ∼Q|X [log Pθ(X|Z)] ≈ 1 n n i=1 log Pθ(X|zi ) = 1 n n i=1 [− (X − f (z1; θ))2 2σ2 − log( √ 2πσ2 )] Jinhwan Suk (Department of Mathematical Science, KAIST)Generative models : VAE and GANs May 7, 2020 19 / 29
  • 20.
    Concept of VAE ELBO MaximizingL(θ, φ, X) is equal to minimizing 1 n n i=1 (X − f (z1; θ))2 2σ2 + 1 2 (1 + log σφ(X)2 ) − µφ(X)2 − σφ(X)2 Jinhwan Suk (Department of Mathematical Science, KAIST)Generative models : VAE and GANs May 7, 2020 20 / 29
  • 21.
    Concept of VAE Problemof the above formulation Since Pθ(Xi ) ≈ 1 n n j=1 P(Xi |zj )P(zj ) and we use n = 1, log Pθ(Xi ) ≈ log[P(Xi |z1)P(z1)] = log P(Xi |z1) + log P(z1) = log 1 √ 2πσ exp(− (Xi − f (z1; θ))2 2σ2 ) + log 1 √ 2π exp(− z2 1 2 ) = − (Xi − f (z1; θ))2 2σ2 + const. Therfore, maximizing log Pθ(Xi ) is transformed to minimizing −(Xi −f (z1;θ))2 2σ2 . Jinhwan Suk (Department of Mathematical Science, KAIST)Generative models : VAE and GANs May 7, 2020 21 / 29
  • 22.
    Concept of VAE Problemof the above formulation To address this problem, we should set σ very small Jinhwan Suk (Department of Mathematical Science, KAIST)Generative models : VAE and GANs May 7, 2020 22 / 29
  • 23.
    Concept of GANs Introduction Goal: estimate population distribution using given observations. Strong assumption on existence of latent variables, Z ∼ PZ . Define G(z; θg ) which is mapping to data space, Pg (X = x) = PZ (G(Z) = x) Define D(x; θd ) that represents the probability that x is real. min G max D V (D, G) = Ex∼Pemp [log D(x)] + E[log(1 − D(G(z)))] What is difference between VAE and GANs?? ⇒ GANs do not formulate about P(X) explicitly. ⇒ But we can show it has a global optimum Pg = Pemp ⇒ So we can say that GANs is generative model. Jinhwan Suk (Department of Mathematical Science, KAIST)Generative models : VAE and GANs May 7, 2020 23 / 29
  • 24.
    Concept of GANs Algorithm V(D, G) = Ex∼Pemp [log D(x)] + E[log(1 − D(G(z)))] ∼ 1 m m i=1 log D(xi ) + 1 m m j=1 log(1 − D(G(zj ))) 1 Sample minibatch of m noise samples and minibatch of m examples. 2 update the discriminator by ascending its stochastic gradient : 1 m m i=1 θd [log D(xi ) + log(1 − D(G(zi )))] 3 Sample minibatch of m noise samples. 4 Update the generator by descending its stochastic gradient : 1 m m i=1 θg [log(1 − D(G(zi )))] Jinhwan Suk (Department of Mathematical Science, KAIST)Generative models : VAE and GANs May 7, 2020 24 / 29
  • 25.
    Concept of GANs Globaloptimality of Pg = Pemp Proposition 1 For G fixed, the optimal discriminator D is D∗ G (x) = Pemp(x) Pemp(x) + Pg (x) Proposition 2 The global minimum of the virtual training criterion is achieved if and only if Pg = Pemp. Jinhwan Suk (Department of Mathematical Science, KAIST)Generative models : VAE and GANs May 7, 2020 25 / 29
  • 26.
    proof of Proposition1 : Let generator G be fixed and define Ax = {z ∈ Z : G(z) = x}. V (G, Dθ) = x∈X log(Dθ(x))Pemp(x) dx + z∈Z log(1 − Dθ(G(z)))PZ (z) dz = x∈X log(Dθ(x))Pemp(x) dx + x∈X z∈Ax log(1 − Dθ(G(z)))PZ (z) dz dx = x∈X log(Dθ(x))Pemp(x) dx + x∈X log(1 − Dθ(x)) z∈Ax PZ (z) dz dx = x∈X log(Dθ(x))Pemp(x) dx + x∈X log(1 − Dθ(x))Pg (x) dx = x∈X log(Dθ(x))Pemp(x) + log(1 − Dθ(x))Pg (x) dx Jinhwan Suk (Department of Mathematical Science, KAIST)Generative models : VAE and GANs May 7, 2020 26 / 29
  • 27.
    proof of Proposition1(continued) : V (G, Dθ) achieves the minimum when ∂ ∂θ [log(Dθ(x))Pemp(x) + log(1 − Dθ(x))Pg (x)] = 0 ∀x ∈ X. ⇔ ∂ ∂θ Dθ(x) Dθ(x) Pemp(x) − ∂ ∂θ Dθ(x) 1−Dθ(x) Pg (x) = 0 ∀x ∈ X ⇔ Dˆθ(x) = Pemp(x) Pemp(x)+Pg (x) ∀x ∈ X Jinhwan Suk (Department of Mathematical Science, KAIST)Generative models : VAE and GANs May 7, 2020 27 / 29
  • 28.
    proof of Proposition2 : min G max D V (G, D) = min G V (G, D∗ G ) = Ex∼Pemp [log D∗ G (x)] + E[log(1 − D∗ G (G(z)))] = Ex∼Pemp [log D∗ G (x)] + Ex∼Pg [log(1 − D∗ G (x))] = Ex∼Pemp log Pemp(x) Pemp(x) + Pg (x) + Ex∼Pg log Pg (x) Pemp(x) + Pg (x) = Ex∼Pemp log Pemp(x) − log Pemp(x) + Pg (x) 2 − log 2 + Ex∼Pg log Pg (x) − log Pemp(x) + Pg (x) 2 − log 2 = DKL(Pemp|| Pemp(x) + Pg (x) 2 ) + DKL(Pg || Pemp(x) + Pg (x) 2 ) − 2 log 2 ≥ −2 log 2 The equality holds if and only if Pemp = Pemp(x)+Pg (x) 2 and Pg = Pemp(x)+Pg (x) 2 Jinhwan Suk (Department of Mathematical Science, KAIST)Generative models : VAE and GANs May 7, 2020 28 / 29
  • 29.
    Thank you Jinhwan Suk(Department of Mathematical Science, KAIST)Generative models : VAE and GANs May 7, 2020 29 / 29