diffusion 모델부터 DALLE2까지.pdf

SCPARK
Diffusion모델부터 DALL·E 2까지

STE @
 
- GSEP: Music Source Separation
 
- GTS: Music & Lyrics Synchronization
 
- ? : Sound Generative Models
박수철 @
 
- Text-To-Speech
 
- Voice Cloning
 
- Voice Conversion
박수철 @
 
- Diffusion/Score-based models
 
- 음성인식과 음성합성
 
- 타코트론의 모든 것
 
- Deep generative models
GSEP, GTS 데모
GSEP, GTS 데모
JTBC 개표방송
JTBC 개표방송

Diffusion모델부터 DALL·E 2까지
- Diffusion Model
 
Sohl-Dickstein, Jascha, et al. "Deep unsupervised learning using nonequilibrium thermodynamics." International Conference on Machine Learning.
PMLR, 2015.
- DDPM
 
Ho, Jonathan, Ajay Jain, and Pieter Abbeel. "Denoising diffusion probabilistic models." Advances in Neural Information Processing Systems 33
(2020): 6840-6851.
- CLIP
 
Radford, Alec, et al. "Learning transferable visual models from natural language supervision." International Conference on Machine Learning. PMLR,
2021.
- GLIDE
 
Nichol, Alex, et al. "Glide: Towards photorealistic image generation and editing with text-guided diffusion models." arXiv preprint arXiv:2112.10741
(2021).
- DALL·E 2
 
Ramesh, Aditya, et al. "Hierarchical text-conditional image generation with clip latents." arXiv preprint arXiv:2204.06125 (2022).
- Guided Diffusion Sampling
 
Dhariwal, Prafulla, and Alexander Nichol. "Diffusion models beat gans on image synthesis." Advances in Neural Information Processing Systems 34
(2021): 8780-8794.
- Classifier-free diffusion guidance
 
Ho, Jonathan, and Tim Salimans. "Classifier-free diffusion guidance." NeurIPS 2021 Workshop on Deep Generative Models and Downstream
Applications. 2021.

Generative Model
- Generative model은 dataset의 probability distribution을 학습하고 sampling하
는 것
Auto-Regressive Model
pθ(x) =
n2
∏
i=1
pθ (xi ∣ x1, …, xi−1)
Van den Oord, Aaron, et al. "Conditional image generation with pixelcnn decoders."
 
Advances in neural information processing systems 29 (2016).
Variational Auto-Encoder
pθ(x) =
∫
pθ(x ∣ z)pθ(z)dz
https://en.wikipedia.org/wiki/Variational_autoencoder

Generative Model
Flow-based Model Generative Adversarial Networks
- Generative model은 dataset의 probability distribution을 학습하고 sampling하
는 것
Lil'Log, Flow-based Deep Generative Models
 
https://lilianweng.github.io/posts/2018-10-13-flow-models/
pθ(x) = pθ(z)|det(dz/dx)|
𝔼
x∼p
data (x)[log D(x)] +
𝔼
z∼pz(z)[log(1 − D(G(z)))]
Goodfellow, Ian, et al. "Generative adversarial nets."
 
Advances in neural information processing systems 27 (2014).

Diffusion Model
- Sohl-Dickstein, Jascha의 논문 Deep unsupervised learning using
nonequilibrium thermodynamics에서 제안
Sohl-Dickstein, Jascha, et al. "Deep unsupervised learning using nonequilibrium thermodynamics." International Conference on Machine Learning. PMLR, 2015.
Di
ff
usion 
https://en.wikipedia.org/wiki/Di
ff
usion

(Flipped)

Diffusion Model
q (x1:T ∣ x0) :=
T
∏
t=1
q (xt ∣ xt−1), q (xt ∣ xt−1, x0) :=
𝒩
(xt; 1 − βtxt−1, βtI)
Forward Process
Posterior q (xt−1 ∣ xt, x0) =
𝒩
(xt−1; μ̃t (xt, x0), β̃tI)
where μ̃t (xt, x0) :=
ᾱt−1 βt
1 − ᾱt
x0 +
αt (1 − ᾱt−1)
1 − ᾱt
xt and β̃t :=
1 − ᾱt−1
1 − ᾱt
βt
pθ (xt−1 ∣ xt) :=
𝒩
(xt−1; μθ (xt, t), Σθ (xt, t))
Backward Process
 
(Neural Networks)
Loss Function DKL (q (xt−1 ∣ xt, x0) ∥pθ (xt−1 ∣ xt))

Diffusion Model - Forward Process
Forward Process
xt−1
xt
Distribution of at an arbitrary timestep t in closed form
xt
q (xt ∣ x0) =
𝒩
(xt; ᾱtx0, (1 − ᾱt) I) , where αt := 1 − βt and ᾱt :=
t
∏
s=1
αs
식 유도는Lil'Log 참고 
https://lilianweng.github.io/posts/2021-07-11-di
ff
usion-models/
q (x1:T ∣ x0) :=
T
∏
t=1
q (xt ∣ xt−1), q (xt ∣ xt−1, x0) :=
𝒩
(xt; 1 − βtxt−1, βtI)

Diffusion Model - Forward Process
MNIST single data, β = 0.2, T = 10
Swiss roll dataset, β = 0.05, T = 10

Diffusion Model - Posterior
𝒩
(xt−1; μ̃t (xt, x0), β̃tI)
ᾱt−1 βt
1 − ᾱt
x0 +
αt (1 − ᾱt−1)
1 − ᾱt
xt and β̃t :=
1 − ᾱt−1
1 − ᾱt
βt
q(xt−1 ∣ xt, x0) =
q(xt−1 ∣ x0)q(xt ∣ xt−1, x0)
q(xt ∣ x0)
by Bayes' Rule
Forward Process q (x1:T ∣ x0) :=
T
∏
t=1
q (xt ∣ xt−1), q (xt ∣ xt−1, x0) :=
𝒩

Diffusion Model - Backward Process
Forward Process
𝒩
ᾱt−1 βt
1 − ᾱt
x0 +
αt (1 − ᾱt−1)
1 − ᾱt
xt and β̃t :=
1 − ᾱt−1
1 − ᾱt
βt
pθ (xt−1 ∣ xt) :=
𝒩
(xt−1; μθ (xt, t), Σθ (xt, t))
Backward Process
 
(Neural Networks)
Loss Function DKL (q (xt−1 ∣ xt, x0) ∥pθ (xt−1 ∣ xt))
U-net
xt
μθ (xt, t),
Σθ (xt, t)
t
q (x1:T ∣ x0) :=
T
∏
t=1
q (xt ∣ xt−1), q (xt ∣ xt−1, x0) :=
𝒩

Diffusion Model - Loss Function
Forward Process
𝒩
ᾱt−1 βt
1 − ᾱt
x0 +
αt (1 − ᾱt−1)
1 − ᾱt
xt and β̃t :=
1 − ᾱt−1
1 − ᾱt
βt
pθ (xt−1 ∣ xt) :=
𝒩
Backward Process
𝒩

Diffusion Model - Output Samples

DDPM (Denoising Diffusion Probabilistic Models)
Ho, Jonathan, Ajay Jain, and Pieter Abbeel. "Denoising diffusion probabilistic models." Advances in Neural Information Processing Systems 33 (2020): 6840-6851.
- Jonathan Ho의 논문 Denoising diffusion probabilistic models에서 제안
Distribution of at an arbitrary timestep t in closed form
xt
x0
ϵ ∼
𝒩
(0, I)
xt (x0, ϵ) = ᾱtx0 + 1 − ᾱtϵ
Lsimple (θ) :=
𝔼
t,x0,ϵ
[
ϵ − ϵθ ( ᾱtx0 + 1 − ᾱtϵ, t)
2
] is a linear combination of and
xt x0 ϵ
q (xt ∣ x0) =
𝒩
(xt; ᾱtx0, (1 − ᾱt) I) , where αt := 1 − βt and ᾱt :=
t
∏
s=1
αs
𝒩
(xt−1; μ̃t (xt, x0), β̃tI)
Loss Function
https://github.com/rosinality/denoising-di
ff
usion-pytorch/blob/master/
di
ff
usion.py
generate ϵ
sample xt
predict ϵ
Predict (or ) at each step
ϵ x0

DDPM - Output Samples
Ho, Jonathan, Ajay Jain, and Pieter Abbeel. "Denoising diffusion probabilistic models." Advances in Neural Information Processing Systems 33 (2020): 6840-6851.

https://github.com/yang-song/score_sde_pytorch/
 
Song, Yang, et al. "Score-based generative modeling through stochastic differential equations." arXiv preprint arXiv:2011.13456 (2020).
DDPM - Output Samples

Diffusion vs. ...
"We emphasize that our objective Eq. (6) requires no adversarial training, no surrogate losses, and
nosampling from the score network during training (e.g., unlike contrastive divergence). Also, it does not
require sθ(x, σ) to have special architectures in order to be tractable."
 
Song, Yang, and Stefano Ermon. "Generative modeling by estimating gradients of the data distribution." Advances in Neural Information Processing Systems 32
(2019).
"We present a novel way to define probabilistic models that allows:
 
1. extreme flexibility in model structure,
2. exact sampling,
3. easy multiplication with other distributions, e.g. in order to compute a posterior, and
4. the model log likelihood, and the probability of individual states, to be cheaply evaluated."

Diffusion vs. ...
Tractability
Flexibility
Auto-
 
Regressive
VAE Flow GAN Diffusion
Good Good Good Not Good Good
Not Good
 
 
Causal structure
Fixed distribution
Not Good
 
Dimension reduction
Fixed distribution
Likelihood
 
can't be evaluated
Not Good
 
Invertible structure
 
Fixed distribution
Good Good

GLIDE - Output Samples
Nichol, Alex, et al. "Glide: Towards photorealistic image generation and editing with text-guided diffusion models." arXiv preprint arXiv:2112.10741 (2021).
- OpenAI에서 논문 Glide: Towards photorealistic image generation and editing
with text-guided diffusion models을 통해 제안

GLIDE - Overall Architecture
A hedgehog using a calculator
Text Encoder
Encoding Sequence
ResBlock
Attention
Down Layer
...
...
Mid
...
GLIDE source : https://github.com/openai/glide-text2im
Down or Up
Up Layer
...
xt
ϵ
AdaIn or Add
Attention
U-net
Layer
(Batch, Channel, Length)
xt+1
xT
⋯ ⋯
xt−1 x0
⋯ ⋯

CLIP (Contrastive Language-Image Pre-training)
- (image, text) 쌍의 데이터로 self-supervised learning을 통해 text/image encoder
를 학습
Radford, Alec, et al. "Learning transferable visual models from natural language supervision." International Conference on Machine Learning. PMLR, 2021.

CLIP (Contrastive Language-Image Pre-training)
https://openai.com/blog/clip/

DALL·E 2 - Output Samples
- OpenAI에서 Hierarchical text-conditional image generation with clip latents 논
문을 통해 제안

DALL·E 2 - Overall Architecture

DALL·E 2 - Prior
- GLIDE text embedding과 CLIP text embedding으로부터 CLIP image embedding
을 생성
Diffusion Backbone
 
(Transformer Decoder)
Text encoding sequence
 
(Batch, Channel, Length)
CLIP text embedding
Diffusion timestep
Noised CLIP image embedding
Final embedding Unnoised CLIP image embedding

Guided Diffusion Sampling
Classifier-free Diffusion Guidance
 
CLIP Guidance

- Diffusion Models Beat GANs on Image Synthesis 논문에서 제안
- Diffusion 모델 외 추가적인 image classifier를 학습시키고 sampling 과정에서 classifier로부터
gradient를 받아 sampling에 도움을 줌
Dhariwal, Prafulla, and Alexander Nichol. "Diffusion models beat gans on image synthesis." Advances in Neural Information Processing Systems 34 (2021): 8780-8794.
̂
μθ (xt ∣ y) = μθ (xt ∣ y) + s ⋅ Σθ (xt ∣ y)∇xt
log pϕ (y ∣ xt)
𝒩
(xt−1; μ̃t (xt, x0), β̃tI)
Nichol, Alex, et al.에서 재인용
수정된 mean classi
fi
er로부터 전해진 gradient
guidance 
scale
기존 mean 기존 covariance
xt
μθ (xt ∣ y)
∇xt
log pϕ (y ∣ xt)

Dhariwal, Prafulla, and Alexander Nichol. "Diffusion models beat gans on image synthesis." Advances in Neural Information Processing Systems 34 (2021): 8780-8794.

Classifier-free guidance
- Ho, Jonathan과 Tim Salimans의 논문 Classifier-free diffusion guidance에서 제안
- 추가적인 classifier를 트레이닝할 필요없이 diffusion 모델만 가지고 guided sampling을 가능하게 만
듬
Ho, Jonathan, and Tim Salimans. "Classifier-free diffusion guidance." NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications. 2021.
Nichol, Alex, et al.에서 재인용
̂
ϵθ (xt ∣ y) = ϵθ (xt ∣ y) + s ⋅ (ϵθ (xt ∣ y) − ϵθ (xt ∣ ∅))
unconditional 
predicted score
conditional 
predicted score
ϵθ (xt ∣ y) ϵθ (xt ∣ y) − ϵθ (xt ∣ ∅) ≈ − σt ∇xt
log pi
(y ∣ xt)
Diffusion Backbone (U-net)
xt
, or
t y ∅
ϵθ
ϵθ (xt ∣ ∅)
guidance 
scale
수정된 score

Classifier-free guidance
Ho, Jonathan, and Tim Salimans. "Classifier-free diffusion guidance." NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications. 2021.

CLIP Guidance
- Classifier를 이용한 guidance 방법과 비슷하게 CLIP 모델을 이용하여 sampling 단계
에서 도움을 줌
̂
μθ (xt ∣ c) = μθ (xt ∣ c) + s ⋅ Σθ (xt ∣ c)∇xt(f (xt) ⋅ g(c))
CLIP 
image encoding
CLIP 
text encoding
xt
μθ (xt ∣ c)
∇xt(f (xt) ⋅ g(c))

Classfier-free Guidance vs. CLIP Guidance in GLIDE

diffusion 모델부터 DALLE2까지.pdf

More Related Content

What's hot

Similar to diffusion 모델부터 DALLE2까지.pdf

More from 수철 박

Recently uploaded

diffusion 모델부터 DALLE2까지.pdf