Generative models : VAE and GAN

Generative models : VAE and GANs
Jinhwan Suk
Department of Mathematical Science, KAIST
May 7, 2020
Jinhwan Suk (Department of Mathematical Science, KAIST)Generative models : VAE and GANs May 7, 2020 1 / 29

Contents
Introduction to Information Theory
What is generative Model?
Example 1 : VAE
Example 2 : GANs

Introduction
Information theory is a branch of applied mathematics.
Originally proposed by Claude Shannon in 1948.
A key measure in information theory is entropy.
Basic Intuition
Learning that an unlikely event has occurred is more informative than
learning that a likely event has occurred.
• Message 1 : ”the sun rose this morning”
• Message 2 : ”there was a solar eclipse this morning”
Message 2 is much more informative than Message 1.

Formalization of Intuitions
Likely events should have low information. And, events that are
guaranteed to happen should have no information content whatsoever.
Less likely events should have higher information content.
Independent events should have additive information. e.g. a tossed
coin has come up as head twice.
Properties of Information function I(x) = IX (x)
I(x) is a function of P(x).
I(x) is inversely proportional to P(x).
I(x) = 0 if P(x) = 1.
I(X1 = x1, X2 = x2) = I(X1 = x1) + I(X2 = x2)

Formalization of Intuitions
Let X1 and X2 be independent random variables with
P(X1 = x1) = p1 and P(X2 = x2) = p2
Then, we have
I(X1 = x1, X2 = x2) = I(P(x1, x2))
= I(P(X1 = x1)P(X2 = x2))
= I(p1p2)
= I(p1) + I(p2)
Thus, I(p) = k log p for some k < 0.

Measure of Information
Definition (Self-Informaiton)
self-information of an event X = x is
I(x) = − log P(x).
self-information is a measure of information(or, uncertainity, surprise) of a
certain single event.
Definition (Shannon-Entropy)
Shannon entropy is the expected amount of information in an entire
probability distribution defined by
H(X) = EX∼P[I(X)] = −EX∼P[log P(X)].

Density Estimation
In classification problem, we usually want to describe P(Y |X) for
each input X.
So many models(cθ) aim to estimate conditional probability
distribution by choosing optimal ˆθ such that
cˆθ(x)[i] = P(Y = yi |X = x),
like softmax classifier or Logistic regresor.
So we can regard the classification problem as the regression problem
such that minimizes
R(cθ) = EX [L(cθ(X), P(Y |X))]
(L measures distance between two probability distribution)

Two ways of measuring distance between probability distributions
Definition (Total variation)
The total variation distance between two probability measures Pθ and
Pθ∗ is defined by
TV (Pθ, Pθ∗ ) = max
A:events
|Pθ(A) − Pθ∗ (A)|.
Definition (Kullback-Leibler divergence)
The KL divergence between two probability measures Pθ and Pθ∗ is
defined by
DKL(Pθ||Pθ∗ ) = EX∼Pθ
[log Pθ(X) − log Pθ∗ (X)],

Cross-Entropy
We usually use KL-divergence because ﬁnding estimator of θ is much
easier in KL-divergence.
DKL(Pθ||Pθ∗ ) = EX∼Pθ
[log P(X) − log Pθ∗ (X)]
= EX∼Pθ
[log Pθ(X)] − EX∼Pθ
[log Pθ∗ (X)]
= constant − EX∼Pθ
[log Pθ∗ (X)]
Hence, minimizing the KL divergence is equivalent to minimizing
−EX∼Pθ
[log Pθ∗ (x)], whose name is cross-entropy. And the estimation
using estimator that minimizes KL divergence or Cross-entropy is called
maximum likelihood principle.

Maximum Likelihood Estimation
Pθ∗ is distribution of population and we want to choose proper estimator ˆθ
by minimizing the distance between Pθ∗ and Pˆθ,
DKL(Pθ∗ || Pˆθ) = const − EX∼Pθ∗ [log Pˆθ(X)]
If X1, X2, ..., Xn are random samples, then by LLN,
EX∼Pθ∗ [log Pˆθ(x)] ∼
1
n
n
i=1
log Pˆθ(Xi )
∴ DKL(Pθ∗ || Pˆθ) = const −
1
n
n
i=1
log Pˆθ(Xi )

Maximum Likelihood Estimation
min
θ∈Θ
DKL(Pθ∗ || Pˆθ) ⇐⇒ min
θ∈Θ
−
1
n
n
i=1
log Pˆθ(Xi )
⇐⇒ max
θ∈Θ
1
n
n
i=1
log Pˆθ(Xi )
⇐⇒ max
θ∈Θ
n
i=1
log Pˆθ(Xi )
⇐⇒ max
θ∈Θ
n
i=1
Pˆθ(Xi )
This is the maximum likelihood principle.

Return to Main Goal : Find an estimator ˆθ that minimizes
R(cθ) = EX [L(cθ(X), P(Y |X))].
Suppose that X1, X2, ..., Xn are i.i.d and cross-entropy is used for L.
EX [L(cθ(X), P(Y |X))] ∼
1
n
n
i=1
L(cθ(Xi ), P(Y |Xi ))
=
1
n
n
i=1
−EY |Xi ∼PYemp|Xi
[log cθ(Xi )]
=
1
n
n
i=1
− log{cθ(Xi )[Yi,true]}.

What is Generative Model?
Generative Model vs Discriminative model
A generative model is a statistical model of the joint distribution on
X × Y , P(X, Y )
A discriminative model is a model of the conditional probability of
the target given an observation x, P(Y |X = x).
In unsupervised learning, generative model usually means the
statistical model of P(X).
How can we estimate joint(conditional) distribution?
What do we obtain while estimating the probability distribution?
What can we do with generative model?

Example of Discriminative Model
Simple Linear Regression
Assumption : P(y|x) = N(α + βx, σ2), σ > 0 is known.

Concept of VAE
Goal : estimate population distribution using given observations.
Strong assumption on existence of latent variables, Z ∼ N(0, I).
X|Z ∼ N(f (Z; θ), σ2
∗ I))
X|Z ∼ Bernoulli(f (Z; θ))
Let Pemp be empirical distribution(assumption : Pemp ≈ Ppop)
arg min
θ
DKL(Pemp(X)||Pθ(X)) = arg min
θ
const − EX∼Pemp [log Pθ(X)]
= arg max
θ
EX∼Pemp [log Pθ(X)]
= arg max
θ
1
N
N
i=1
[log Pθ(Xi )]

Concept of VAE
Naive approach
Maximize Pθ(Xi ) w.r.t θ for each samples X1, X2, ..., Xn.
=⇒ But, Pθ(Xi ) is intractable.
Pθ(Xi ) =
Z
Pθ(Xi , z) dz
=
Z
Pθ(Xi |z)P(z) dz
∼
1
n
n
j=1
Pθ(Xi |Zj )P(Zj )
If we pick n large, then the approximation would be done quite well.
But for eﬃciency, we look for some other way to set n small.

Concept of VAE
DKL(Qφ(Z|X)||Pθ(Z|X)) = EZ∼Q|X [log Qφ(Z|X) − log Pθ(Z|X)]
= EZ∼Q|X [log Qφ(Z|X) − log Pθ(X, Z)]
+ log Pθ(X)
We want to maximize log Pθ(X) and minimize
DKL(Qφ(Z|X)||Pθ(Z|X)) at once.
Deﬁne L(θ, φ, X) = EZ∼Q|X [log Pθ(X, Z) − log Qφ(Z|X)]
log Pθ(X) − DKL(Qφ(Z|X)||Pθ(Z|X)) = L(θ, φ, X)

Concept of VAE
ELBO
L(θ, φ, X) = EZ∼Q|X [log Pθ(X, Z) − log Qφ(Z|X)]
= EZ∼Q|X [log Pθ(X|Z) + log Pθ(Z) − log Qφ(Z|X)]
= EZ∼Q|X [log Pθ(X|Z)] − DKL(Qφ(Z|X)||Pθ(Z))
DKL(Qφ(Z|X)||Pθ(Z)) can be integrated analytically
DKL(Qφ(Z|X)||Pθ(Z)) =
1
2
(1 + log σφ(X)2
) − µφ(X)2
− σφ(X)2
EZ∼Q|X [log Pθ(X|Z)] requires estimation by sampling.
EZ∼Q|X [log Pθ(X|Z)] ≈
1
n
n
i=1
log Pθ(X|zi )
=
1
n
n
i=1
[−
(X − f (z1; θ))2
2σ2
− log(
√
2πσ2
)]

Concept of VAE
ELBO
Maximizing L(θ, φ, X) is equal to minimizing
1
n
n
i=1
(X − f (z1; θ))2
2σ2
+
1
2
(1 + log σφ(X)2
) − µφ(X)2
− σφ(X)2

Concept of VAE
Problem of the above formulation
Since Pθ(Xi ) ≈ 1
n
n
j=1 P(Xi |zj )P(zj ) and we use n = 1,
log Pθ(Xi ) ≈ log[P(Xi |z1)P(z1)]
= log P(Xi |z1) + log P(z1)
= log
1
√
2πσ
exp(−
(Xi − f (z1; θ))2
2σ2
) + log
1
√
2π
exp(−
z2
1
2
)
= −
(Xi − f (z1; θ))2
2σ2
+ const.
Therfore, maximizing log Pθ(Xi ) is transformed to
minimizing −(Xi −f (z1;θ))2
2σ2 .

Concept of VAE
Problem of the above formulation
To address this problem, we should set σ very small

Concept of GANs
Introduction
Goal : estimate population distribution using given observations.
Strong assumption on existence of latent variables, Z ∼ PZ .
Define G(z; θg ) which is mapping to data space,
Pg (X = x) = PZ (G(Z) = x)
Define D(x; θd ) that represents the probability that x is real.
min
G
max
D
V (D, G) = Ex∼Pemp [log D(x)] + E[log(1 − D(G(z)))]
What is difference between VAE and GANs??
⇒ GANs do not formulate about P(X) explicitly.
⇒ But we can show it has a global optimum Pg = Pemp
⇒ So we can say that GANs is generative model.

Concept of GANs
Algorithm
V (D, G) = Ex∼Pemp [log D(x)] + E[log(1 − D(G(z)))]
∼
1
m
m
i=1
log D(xi ) +
1
m
m
j=1
log(1 − D(G(zj )))
1 Sample minibatch of m noise samples and minibatch of m examples.
2 update the discriminator by ascending its stochastic gradient :
1
m
m
i=1
θd
[log D(xi ) + log(1 − D(G(zi )))]
3 Sample minibatch of m noise samples.
4 Update the generator by descending its stochastic gradient :
1
m
m
i=1
θg [log(1 − D(G(zi )))]

Concept of GANs
Global optimality of Pg = Pemp
Proposition 1
For G ﬁxed, the optimal discriminator D is
D∗
G (x) =
Pemp(x)
Pemp(x) + Pg (x)
Proposition 2
The global minimum of the virtual training criterion is achieved if and only
if Pg = Pemp.

proof of Proposition 1 :
Let generator G be ﬁxed and deﬁne Ax = {z ∈ Z : G(z) = x}.
V (G, Dθ) =
x∈X
log(Dθ(x))Pemp(x) dx +
z∈Z
log(1 − Dθ(G(z)))PZ (z) dz
=
x∈X
x∈X z∈Ax
log(1 − Dθ(G(z)))PZ (z) dz dx
=
x∈X
x∈X
log(1 − Dθ(x))
z∈Ax
PZ (z) dz dx
=
x∈X
x∈X
log(1 − Dθ(x))Pg (x) dx
=
x∈X
log(Dθ(x))Pemp(x) + log(1 − Dθ(x))Pg (x) dx

proof of Proposition 1(continued) :
V (G, Dθ) achieves the minimum when
∂
∂θ [log(Dθ(x))Pemp(x) + log(1 − Dθ(x))Pg (x)] = 0 ∀x ∈ X.
⇔
∂
∂θ
Dθ(x)
Dθ(x) Pemp(x) −
∂
∂θ
Dθ(x)
1−Dθ(x) Pg (x) = 0 ∀x ∈ X
⇔ Dˆθ(x) =
Pemp(x)
Pemp(x)+Pg (x) ∀x ∈ X

proof of Proposition 2 :
min
G
max
D
V (G, D) = min
G
V (G, D∗
G )
= Ex∼Pemp
[log D∗
G (x)] + E[log(1 − D∗
G (G(z)))]
= Ex∼Pemp
[log D∗
G (x)] + Ex∼Pg
[log(1 − D∗
G (x))]
= Ex∼Pemp
log
Pemp(x)
Pemp(x) + Pg (x)
+ Ex∼Pg
log
Pg (x)
Pemp(x) + Pg (x)
= Ex∼Pemp
log Pemp(x) − log
Pemp(x) + Pg (x)
2
− log 2
+ Ex∼Pg log Pg (x) − log
Pemp(x) + Pg (x)
2
− log 2
= DKL(Pemp||
Pemp(x) + Pg (x)
2
) + DKL(Pg ||
Pemp(x) + Pg (x)
2
)
− 2 log 2
≥ −2 log 2
The equality holds if and only if Pemp =
Pemp(x)+Pg (x)
2 and Pg =
Pemp(x)+Pg (x)
2

Thank you

Generative models : VAE and GAN

More Related Content

What's hot

Similar to Generative models : VAE and GAN

More from SEMINARGROOT

Recently uploaded

Generative models : VAE and GAN