SCPARK
Diffusion모델부터 DALL·E 2까지
STE @


- GSEP: Music Source Separation


- GTS: Music & Lyrics Synchronization


- ? : Sound Generative Models
박수철 @


- Text-To-Speech


- Voice Cloning


- Voice Conversion
박수철 @


- Diffusion/Score-based models


- 음성인식과 음성합성


- 타코트론의 모든 것


- Deep generative models
GSEP, GTS 데모
GSEP, GTS 데모
JTBC 개표방송
JTBC 개표방송
Diffusion모델부터 DALL·E 2까지
- Diffusion Model


Sohl-Dickstein, Jascha, et al. "Deep unsupervised learning using nonequilibrium thermodynamics." International Conference on Machine Learning.
PMLR, 2015.
- DDPM


Ho, Jonathan, Ajay Jain, and Pieter Abbeel. "Denoising diffusion probabilistic models." Advances in Neural Information Processing Systems 33
(2020): 6840-6851.
- CLIP


Radford, Alec, et al. "Learning transferable visual models from natural language supervision." International Conference on Machine Learning. PMLR,
2021.
- GLIDE


Nichol, Alex, et al. "Glide: Towards photorealistic image generation and editing with text-guided diffusion models." arXiv preprint arXiv:2112.10741
(2021).
- DALL·E 2


Ramesh, Aditya, et al. "Hierarchical text-conditional image generation with clip latents." arXiv preprint arXiv:2204.06125 (2022).
- Guided Diffusion Sampling


Dhariwal, Prafulla, and Alexander Nichol. "Diffusion models beat gans on image synthesis." Advances in Neural Information Processing Systems 34
(2021): 8780-8794.
- Classifier-free diffusion guidance


Ho, Jonathan, and Tim Salimans. "Classifier-free diffusion guidance." NeurIPS 2021 Workshop on Deep Generative Models and Downstream
Applications. 2021.
Generative Model
Generative Model
- Generative model은 dataset의 probability distribution을 학습하고 sampling하
는 것
Auto-Regressive Model
pθ(x) =
n2
∏
i=1
pθ (xi ∣ x1, …, xi−1)
Van den Oord, Aaron, et al. "Conditional image generation with pixelcnn decoders."


Advances in neural information processing systems 29 (2016).
Variational Auto-Encoder
pθ(x) =
∫
pθ(x ∣ z)pθ(z)dz
https://en.wikipedia.org/wiki/Variational_autoencoder
Generative Model
Flow-based Model Generative Adversarial Networks
- Generative model은 dataset의 probability distribution을 학습하고 sampling하
는 것
Lil'Log, Flow-based Deep Generative Models


https://lilianweng.github.io/posts/2018-10-13-flow-models/
pθ(x) = pθ(z)|det(dz/dx)|
𝔼
x∼p
data (x)[log D(x)] +
𝔼
z∼pz(z)[log(1 − D(G(z)))]
Goodfellow, Ian, et al. "Generative adversarial nets."


Advances in neural information processing systems 27 (2014).
Diffusion Model


DDPM
Diffusion Model
- Sohl-Dickstein, Jascha의 논문 Deep unsupervised learning using
nonequilibrium thermodynamics에서 제안
Sohl-Dickstein, Jascha, et al. "Deep unsupervised learning using nonequilibrium thermodynamics." International Conference on Machine Learning. PMLR, 2015.
Di
ff
usion

https://en.wikipedia.org/wiki/Di
ff
usion

(Flipped)
Diffusion Model
Sohl-Dickstein, Jascha, et al. "Deep unsupervised learning using nonequilibrium thermodynamics." International Conference on Machine Learning. PMLR, 2015.
q (x1:T ∣ x0) :=
T
∏
t=1
q (xt ∣ xt−1), q (xt ∣ xt−1, x0) :=
𝒩
(xt; 1 − βtxt−1, βtI)
Forward Process
Posterior q (xt−1 ∣ xt, x0) =
𝒩
(xt−1; μ̃t (xt, x0), β̃tI)
 where  μ̃t (xt, x0) :=
ᾱt−1 βt
1 − ᾱt
x0 +
αt (1 − ᾱt−1)
1 − ᾱt
xt  and  β̃t :=
1 − ᾱt−1
1 − ᾱt
βt
pθ (xt−1 ∣ xt) :=
𝒩
(xt−1; μθ (xt, t), Σθ (xt, t))
Backward Process


(Neural Networks)
Loss Function DKL (q (xt−1 ∣ xt, x0) ∥pθ (xt−1 ∣ xt))
Diffusion Model - Forward Process
Sohl-Dickstein, Jascha, et al. "Deep unsupervised learning using nonequilibrium thermodynamics." International Conference on Machine Learning. PMLR, 2015.
Forward Process
xt−1
xt
Distribution of at an arbitrary timestep t in closed form
xt
q (xt ∣ x0) =
𝒩
(xt; ᾱtx0, (1 − ᾱt) I) , where  αt := 1 − βt and ᾱt :=
t
∏
s=1
αs
식 유도는Lil'Log 참고

https://lilianweng.github.io/posts/2021-07-11-di
ff
usion-models/
q (x1:T ∣ x0) :=
T
∏
t=1
q (xt ∣ xt−1), q (xt ∣ xt−1, x0) :=
𝒩
(xt; 1 − βtxt−1, βtI)
Diffusion Model - Forward Process
Sohl-Dickstein, Jascha, et al. "Deep unsupervised learning using nonequilibrium thermodynamics." International Conference on Machine Learning. PMLR, 2015.
MNIST single data, β = 0.2, T = 10
Swiss roll dataset, β = 0.05, T = 10
Diffusion Model - Posterior
Sohl-Dickstein, Jascha, et al. "Deep unsupervised learning using nonequilibrium thermodynamics." International Conference on Machine Learning. PMLR, 2015.
Posterior q (xt−1 ∣ xt, x0) =
𝒩
(xt−1; μ̃t (xt, x0), β̃tI)
 where  μ̃t (xt, x0) :=
ᾱt−1 βt
1 − ᾱt
x0 +
αt (1 − ᾱt−1)
1 − ᾱt
xt  and  β̃t :=
1 − ᾱt−1
1 − ᾱt
βt
q(xt−1 ∣ xt, x0) =
q(xt−1 ∣ x0)q(xt ∣ xt−1, x0)
q(xt ∣ x0)
by Bayes' Rule
Forward Process q (x1:T ∣ x0) :=
T
∏
t=1
q (xt ∣ xt−1), q (xt ∣ xt−1, x0) :=
𝒩
Diffusion Model - Backward Process
Sohl-Dickstein, Jascha, et al. "Deep unsupervised learning using nonequilibrium thermodynamics." International Conference on Machine Learning. PMLR, 2015.
Forward Process
Posterior q (xt−1 ∣ xt, x0) =
𝒩
 where  μ̃t (xt, x0) :=
ᾱt−1 βt
1 − ᾱt
x0 +
αt (1 − ᾱt−1)
1 − ᾱt
xt  and  β̃t :=
1 − ᾱt−1
1 − ᾱt
βt
pθ (xt−1 ∣ xt) :=
𝒩
(xt−1; μθ (xt, t), Σθ (xt, t))
Backward Process


(Neural Networks)
Loss Function DKL (q (xt−1 ∣ xt, x0) ∥pθ (xt−1 ∣ xt))
U-net
xt
μθ (xt, t),
Σθ (xt, t)
t
q (x1:T ∣ x0) :=
T
∏
t=1
q (xt ∣ xt−1), q (xt ∣ xt−1, x0) :=
𝒩
Diffusion Model - Loss Function
Sohl-Dickstein, Jascha, et al. "Deep unsupervised learning using nonequilibrium thermodynamics." International Conference on Machine Learning. PMLR, 2015.
Forward Process
Posterior q (xt−1 ∣ xt, x0) =
𝒩
 where  μ̃t (xt, x0) :=
ᾱt−1 βt
1 − ᾱt
x0 +
αt (1 − ᾱt−1)
1 − ᾱt
xt  and  β̃t :=
1 − ᾱt−1
1 − ᾱt
βt
pθ (xt−1 ∣ xt) :=
𝒩
Backward Process
𝒩
Diffusion Model - Output Samples
Sohl-Dickstein, Jascha, et al. "Deep unsupervised learning using nonequilibrium thermodynamics." International Conference on Machine Learning. PMLR, 2015.
DDPM (Denoising Diffusion Probabilistic Models)
Ho, Jonathan, Ajay Jain, and Pieter Abbeel. "Denoising diffusion probabilistic models." Advances in Neural Information Processing Systems 33 (2020): 6840-6851.
- Jonathan Ho의 논문 Denoising diffusion probabilistic models에서 제안
Distribution of at an arbitrary timestep t in closed form
xt
x0
ϵ ∼
𝒩
(0, I)
xt (x0, ϵ) = ᾱtx0 + 1 − ᾱtϵ
Lsimple (θ) :=
𝔼
t,x0,ϵ
[
ϵ − ϵθ ( ᾱtx0 + 1 − ᾱtϵ, t)
2
] is a linear combination of and
xt x0 ϵ
q (xt ∣ x0) =
𝒩
(xt; ᾱtx0, (1 − ᾱt) I) , where  αt := 1 − βt and ᾱt :=
t
∏
s=1
αs
Posterior q (xt−1 ∣ xt, x0) =
𝒩
(xt−1; μ̃t (xt, x0), β̃tI)
Loss Function
https://github.com/rosinality/denoising-di
ff
usion-pytorch/blob/master/
di
ff
usion.py
generate ϵ
sample xt
predict ϵ
Predict (or ) at each step
ϵ x0
DDPM (Denoising Diffusion Probabilistic Models)
Ho, Jonathan, Ajay Jain, and Pieter Abbeel. "Denoising diffusion probabilistic models." Advances in Neural Information Processing Systems 33 (2020): 6840-6851.
- Jonathan Ho의 논문 Denoising diffusion probabilistic models에서 제안
Distribution of at an arbitrary timestep t in closed form
xt
x0
ϵ ∼
𝒩
(0, I)
xt (x0, ϵ) = ᾱtx0 + 1 − ᾱtϵ
Lsimple (θ) :=
𝔼
t,x0,ϵ
[
ϵ − ϵθ ( ᾱtx0 + 1 − ᾱtϵ, t)
2
] is a linear combination of and
xt x0 ϵ
q (xt ∣ x0) =
𝒩
(xt; ᾱtx0, (1 − ᾱt) I) , where  αt := 1 − βt and ᾱt :=
t
∏
s=1
αs
Posterior q (xt−1 ∣ xt, x0) =
𝒩
(xt−1; μ̃t (xt, x0), β̃tI)
Loss Function
https://github.com/rosinality/denoising-di
ff
usion-pytorch/blob/master/
di
ff
usion.py
generate ϵ
sample xt
predict ϵ
Predict (or ) at each step
ϵ x0
DDPM - Output Samples
Ho, Jonathan, Ajay Jain, and Pieter Abbeel. "Denoising diffusion probabilistic models." Advances in Neural Information Processing Systems 33 (2020): 6840-6851.
https://github.com/yang-song/score_sde_pytorch/


Song, Yang, et al. "Score-based generative modeling through stochastic differential equations." arXiv preprint arXiv:2011.13456 (2020).
DDPM - Output Samples
Diffusion vs. ...
"We emphasize that our objective Eq. (6) requires no adversarial training, no surrogate losses, and
nosampling from the score network during training (e.g., unlike contrastive divergence). Also, it does not
require sθ(x, σ) to have special architectures in order to be tractable."


Song, Yang, and Stefano Ermon. "Generative modeling by estimating gradients of the data distribution." Advances in Neural Information Processing Systems 32
(2019).
"We present a novel way to define probabilistic models that allows:


1. extreme flexibility in model structure,
2. exact sampling,
3. easy multiplication with other distributions, e.g. in order to compute a posterior, and
4. the model log likelihood, and the probability of individual states, to be cheaply evaluated."
Sohl-Dickstein, Jascha, et al. "Deep unsupervised learning using nonequilibrium thermodynamics." International Conference on Machine Learning. PMLR, 2015.
Diffusion vs. ...
Tractability
Flexibility
Auto-


Regressive
VAE Flow GAN Diffusion
Good Good Good Not Good Good
Not Good




Causal structure
Fixed distribution
Not Good


Dimension reduction
Fixed distribution
Likelihood


can't be evaluated
Not Good


Invertible structure


Fixed distribution
Good Good
GLIDE


CLIP


DALL·E 2
GLIDE - Output Samples
Nichol, Alex, et al. "Glide: Towards photorealistic image generation and editing with text-guided diffusion models." arXiv preprint arXiv:2112.10741 (2021).
- OpenAI에서 논문 Glide: Towards photorealistic image generation and editing
with text-guided diffusion models을 통해 제안
GLIDE - Overall Architecture
Nichol, Alex, et al. "Glide: Towards photorealistic image generation and editing with text-guided diffusion models." arXiv preprint arXiv:2112.10741 (2021).
A hedgehog using a calculator
Text Encoder
Encoding Sequence
ResBlock
Attention
Down Layer
...
...
Mid
...
GLIDE source : https://github.com/openai/glide-text2im
Down or Up
Up Layer
...
xt
ϵ
AdaIn or Add
Attention
U-net
Layer
(Batch, Channel, Length)
xt+1
xT
⋯ ⋯
xt−1 x0
⋯ ⋯
CLIP (Contrastive Language-Image Pre-training)
- (image, text) 쌍의 데이터로 self-supervised learning을 통해 text/image encoder
를 학습
Radford, Alec, et al. "Learning transferable visual models from natural language supervision." International Conference on Machine Learning. PMLR, 2021.
CLIP (Contrastive Language-Image Pre-training)
- (image, text) 쌍의 데이터로 self-supervised learning을 통해 text/image encoder
를 학습
Radford, Alec, et al. "Learning transferable visual models from natural language supervision." International Conference on Machine Learning. PMLR, 2021.
CLIP (Contrastive Language-Image Pre-training)
https://openai.com/blog/clip/
DALL·E 2 - Output Samples
- OpenAI에서 Hierarchical text-conditional image generation with clip latents 논
문을 통해 제안
Ramesh, Aditya, et al. "Hierarchical text-conditional image generation with clip latents." arXiv preprint arXiv:2204.06125 (2022).
DALL·E 2 - Overall Architecture
Ramesh, Aditya, et al. "Hierarchical text-conditional image generation with clip latents." arXiv preprint arXiv:2204.06125 (2022).
DALL·E 2 - Prior
- GLIDE text embedding과 CLIP text embedding으로부터 CLIP image embedding
을 생성
Diffusion Backbone


(Transformer Decoder)
Text encoding sequence


(Batch, Channel, Length)
CLIP text embedding
Diffusion timestep
Noised CLIP image embedding
Final embedding Unnoised CLIP image embedding
Ramesh, Aditya, et al. "Hierarchical text-conditional image generation with clip latents." arXiv preprint arXiv:2204.06125 (2022).
Guided Diffusion Sampling
Classifier-free Diffusion Guidance


CLIP Guidance
Guided Diffusion Sampling
- Diffusion Models Beat GANs on Image Synthesis 논문에서 제안
- Diffusion 모델 외 추가적인 image classifier를 학습시키고 sampling 과정에서 classifier로부터
gradient를 받아 sampling에 도움을 줌
Dhariwal, Prafulla, and Alexander Nichol. "Diffusion models beat gans on image synthesis." Advances in Neural Information Processing Systems 34 (2021): 8780-8794.
̂
μθ (xt ∣ y) = μθ (xt ∣ y) + s ⋅ Σθ (xt ∣ y)∇xt
log pϕ (y ∣ xt)
Posterior q (xt−1 ∣ xt, x0) =
𝒩
(xt−1; μ̃t (xt, x0), β̃tI)
Nichol, Alex, et al. "Glide: Towards photorealistic image generation and editing with text-guided diffusion models." arXiv preprint arXiv:2112.10741 (2021).
Nichol, Alex, et al.에서 재인용
수정된 mean classi
fi
er로부터 전해진 gradient
guidance

scale
기존 mean 기존 covariance
xt
μθ (xt ∣ y)
∇xt
log pϕ (y ∣ xt)
Guided Diffusion Sampling
Dhariwal, Prafulla, and Alexander Nichol. "Diffusion models beat gans on image synthesis." Advances in Neural Information Processing Systems 34 (2021): 8780-8794.
Classifier-free guidance
- Ho, Jonathan과 Tim Salimans의 논문 Classifier-free diffusion guidance에서 제안
- 추가적인 classifier를 트레이닝할 필요없이 diffusion 모델만 가지고 guided sampling을 가능하게 만
듬
Ho, Jonathan, and Tim Salimans. "Classifier-free diffusion guidance." NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications. 2021.
Nichol, Alex, et al. "Glide: Towards photorealistic image generation and editing with text-guided diffusion models." arXiv preprint arXiv:2112.10741 (2021).
Nichol, Alex, et al.에서 재인용
̂
ϵθ (xt ∣ y) = ϵθ (xt ∣ y) + s ⋅ (ϵθ (xt ∣ y) − ϵθ (xt ∣ ∅))
unconditional

predicted score
conditional

predicted score
ϵθ (xt ∣ y) ϵθ (xt ∣ y) − ϵθ (xt ∣ ∅) ≈ − σt ∇xt
log pi
(y ∣ xt)
Diffusion Backbone (U-net)
xt
, or
t y ∅
ϵθ
ϵθ (xt ∣ ∅)
guidance

scale
수정된 score
Classifier-free guidance
Ho, Jonathan, and Tim Salimans. "Classifier-free diffusion guidance." NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications. 2021.
CLIP Guidance
- Classifier를 이용한 guidance 방법과 비슷하게 CLIP 모델을 이용하여 sampling 단계
에서 도움을 줌
Nichol, Alex, et al. "Glide: Towards photorealistic image generation and editing with text-guided diffusion models." arXiv preprint arXiv:2112.10741 (2021).
̂
μθ (xt ∣ c) = μθ (xt ∣ c) + s ⋅ Σθ (xt ∣ c)∇xt(f (xt) ⋅ g(c))
CLIP

image encoding
CLIP

text encoding
xt
μθ (xt ∣ c)
∇xt(f (xt) ⋅ g(c))
Classfier-free Guidance vs. CLIP Guidance in GLIDE
Nichol, Alex, et al. "Glide: Towards photorealistic image generation and editing with text-guided diffusion models." arXiv preprint arXiv:2112.10741 (2021).
감사합니다 :)

diffusion 모델부터 DALLE2까지.pdf

  • 1.
  • 2.
    STE @ 
 - GSEP:Music Source Separation 
 - GTS: Music & Lyrics Synchronization 
 - ? : Sound Generative Models 박수철 @ 
 - Text-To-Speech 
 - Voice Cloning 
 - Voice Conversion 박수철 @ 
 - Diffusion/Score-based models 
 - 음성인식과 음성합성 
 - 타코트론의 모든 것 
 - Deep generative models GSEP, GTS 데모 GSEP, GTS 데모 JTBC 개표방송 JTBC 개표방송
  • 3.
    Diffusion모델부터 DALL·E 2까지 -Diffusion Model 
 Sohl-Dickstein, Jascha, et al. "Deep unsupervised learning using nonequilibrium thermodynamics." International Conference on Machine Learning. PMLR, 2015. - DDPM 
 Ho, Jonathan, Ajay Jain, and Pieter Abbeel. "Denoising diffusion probabilistic models." Advances in Neural Information Processing Systems 33 (2020): 6840-6851. - CLIP 
 Radford, Alec, et al. "Learning transferable visual models from natural language supervision." International Conference on Machine Learning. PMLR, 2021. - GLIDE 
 Nichol, Alex, et al. "Glide: Towards photorealistic image generation and editing with text-guided diffusion models." arXiv preprint arXiv:2112.10741 (2021). - DALL·E 2 
 Ramesh, Aditya, et al. "Hierarchical text-conditional image generation with clip latents." arXiv preprint arXiv:2204.06125 (2022). - Guided Diffusion Sampling 
 Dhariwal, Prafulla, and Alexander Nichol. "Diffusion models beat gans on image synthesis." Advances in Neural Information Processing Systems 34 (2021): 8780-8794. - Classifier-free diffusion guidance 
 Ho, Jonathan, and Tim Salimans. "Classifier-free diffusion guidance." NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications. 2021.
  • 4.
  • 5.
    Generative Model - Generativemodel은 dataset의 probability distribution을 학습하고 sampling하 는 것 Auto-Regressive Model pθ(x) = n2 ∏ i=1 pθ (xi ∣ x1, …, xi−1) Van den Oord, Aaron, et al. "Conditional image generation with pixelcnn decoders." 
 Advances in neural information processing systems 29 (2016). Variational Auto-Encoder pθ(x) = ∫ pθ(x ∣ z)pθ(z)dz https://en.wikipedia.org/wiki/Variational_autoencoder
  • 6.
    Generative Model Flow-based ModelGenerative Adversarial Networks - Generative model은 dataset의 probability distribution을 학습하고 sampling하 는 것 Lil'Log, Flow-based Deep Generative Models 
 https://lilianweng.github.io/posts/2018-10-13-flow-models/ pθ(x) = pθ(z)|det(dz/dx)| 𝔼 x∼p data (x)[log D(x)] + 𝔼 z∼pz(z)[log(1 − D(G(z)))] Goodfellow, Ian, et al. "Generative adversarial nets." 
 Advances in neural information processing systems 27 (2014).
  • 7.
  • 8.
    Diffusion Model - Sohl-Dickstein,Jascha의 논문 Deep unsupervised learning using nonequilibrium thermodynamics에서 제안 Sohl-Dickstein, Jascha, et al. "Deep unsupervised learning using nonequilibrium thermodynamics." International Conference on Machine Learning. PMLR, 2015. Di ff usion
 https://en.wikipedia.org/wiki/Di ff usion (Flipped)
  • 9.
    Diffusion Model Sohl-Dickstein, Jascha,et al. "Deep unsupervised learning using nonequilibrium thermodynamics." International Conference on Machine Learning. PMLR, 2015. q (x1:T ∣ x0) := T ∏ t=1 q (xt ∣ xt−1), q (xt ∣ xt−1, x0) := 𝒩 (xt; 1 − βtxt−1, βtI) Forward Process Posterior q (xt−1 ∣ xt, x0) = 𝒩 (xt−1; μ̃t (xt, x0), β̃tI)  where  μ̃t (xt, x0) := ᾱt−1 βt 1 − ᾱt x0 + αt (1 − ᾱt−1) 1 − ᾱt xt  and  β̃t := 1 − ᾱt−1 1 − ᾱt βt pθ (xt−1 ∣ xt) := 𝒩 (xt−1; μθ (xt, t), Σθ (xt, t)) Backward Process 
 (Neural Networks) Loss Function DKL (q (xt−1 ∣ xt, x0) ∥pθ (xt−1 ∣ xt))
  • 10.
    Diffusion Model -Forward Process Sohl-Dickstein, Jascha, et al. "Deep unsupervised learning using nonequilibrium thermodynamics." International Conference on Machine Learning. PMLR, 2015. Forward Process xt−1 xt Distribution of at an arbitrary timestep t in closed form xt q (xt ∣ x0) = 𝒩 (xt; ᾱtx0, (1 − ᾱt) I) , where  αt := 1 − βt and ᾱt := t ∏ s=1 αs 식 유도는Lil'Log 참고
 https://lilianweng.github.io/posts/2021-07-11-di ff usion-models/ q (x1:T ∣ x0) := T ∏ t=1 q (xt ∣ xt−1), q (xt ∣ xt−1, x0) := 𝒩 (xt; 1 − βtxt−1, βtI)
  • 11.
    Diffusion Model -Forward Process Sohl-Dickstein, Jascha, et al. "Deep unsupervised learning using nonequilibrium thermodynamics." International Conference on Machine Learning. PMLR, 2015. MNIST single data, β = 0.2, T = 10 Swiss roll dataset, β = 0.05, T = 10
  • 12.
    Diffusion Model -Posterior Sohl-Dickstein, Jascha, et al. "Deep unsupervised learning using nonequilibrium thermodynamics." International Conference on Machine Learning. PMLR, 2015. Posterior q (xt−1 ∣ xt, x0) = 𝒩 (xt−1; μ̃t (xt, x0), β̃tI)  where  μ̃t (xt, x0) := ᾱt−1 βt 1 − ᾱt x0 + αt (1 − ᾱt−1) 1 − ᾱt xt  and  β̃t := 1 − ᾱt−1 1 − ᾱt βt q(xt−1 ∣ xt, x0) = q(xt−1 ∣ x0)q(xt ∣ xt−1, x0) q(xt ∣ x0) by Bayes' Rule Forward Process q (x1:T ∣ x0) := T ∏ t=1 q (xt ∣ xt−1), q (xt ∣ xt−1, x0) := 𝒩
  • 13.
    Diffusion Model -Backward Process Sohl-Dickstein, Jascha, et al. "Deep unsupervised learning using nonequilibrium thermodynamics." International Conference on Machine Learning. PMLR, 2015. Forward Process Posterior q (xt−1 ∣ xt, x0) = 𝒩  where  μ̃t (xt, x0) := ᾱt−1 βt 1 − ᾱt x0 + αt (1 − ᾱt−1) 1 − ᾱt xt  and  β̃t := 1 − ᾱt−1 1 − ᾱt βt pθ (xt−1 ∣ xt) := 𝒩 (xt−1; μθ (xt, t), Σθ (xt, t)) Backward Process 
 (Neural Networks) Loss Function DKL (q (xt−1 ∣ xt, x0) ∥pθ (xt−1 ∣ xt)) U-net xt μθ (xt, t), Σθ (xt, t) t q (x1:T ∣ x0) := T ∏ t=1 q (xt ∣ xt−1), q (xt ∣ xt−1, x0) := 𝒩
  • 14.
    Diffusion Model -Loss Function Sohl-Dickstein, Jascha, et al. "Deep unsupervised learning using nonequilibrium thermodynamics." International Conference on Machine Learning. PMLR, 2015. Forward Process Posterior q (xt−1 ∣ xt, x0) = 𝒩  where  μ̃t (xt, x0) := ᾱt−1 βt 1 − ᾱt x0 + αt (1 − ᾱt−1) 1 − ᾱt xt  and  β̃t := 1 − ᾱt−1 1 − ᾱt βt pθ (xt−1 ∣ xt) := 𝒩 Backward Process 𝒩
  • 15.
    Diffusion Model -Output Samples Sohl-Dickstein, Jascha, et al. "Deep unsupervised learning using nonequilibrium thermodynamics." International Conference on Machine Learning. PMLR, 2015.
  • 16.
    DDPM (Denoising DiffusionProbabilistic Models) Ho, Jonathan, Ajay Jain, and Pieter Abbeel. "Denoising diffusion probabilistic models." Advances in Neural Information Processing Systems 33 (2020): 6840-6851. - Jonathan Ho의 논문 Denoising diffusion probabilistic models에서 제안 Distribution of at an arbitrary timestep t in closed form xt x0 ϵ ∼ 𝒩 (0, I) xt (x0, ϵ) = ᾱtx0 + 1 − ᾱtϵ Lsimple (θ) := 𝔼 t,x0,ϵ [ ϵ − ϵθ ( ᾱtx0 + 1 − ᾱtϵ, t) 2 ] is a linear combination of and xt x0 ϵ q (xt ∣ x0) = 𝒩 (xt; ᾱtx0, (1 − ᾱt) I) , where  αt := 1 − βt and ᾱt := t ∏ s=1 αs Posterior q (xt−1 ∣ xt, x0) = 𝒩 (xt−1; μ̃t (xt, x0), β̃tI) Loss Function https://github.com/rosinality/denoising-di ff usion-pytorch/blob/master/ di ff usion.py generate ϵ sample xt predict ϵ Predict (or ) at each step ϵ x0
  • 17.
    DDPM (Denoising DiffusionProbabilistic Models) Ho, Jonathan, Ajay Jain, and Pieter Abbeel. "Denoising diffusion probabilistic models." Advances in Neural Information Processing Systems 33 (2020): 6840-6851. - Jonathan Ho의 논문 Denoising diffusion probabilistic models에서 제안 Distribution of at an arbitrary timestep t in closed form xt x0 ϵ ∼ 𝒩 (0, I) xt (x0, ϵ) = ᾱtx0 + 1 − ᾱtϵ Lsimple (θ) := 𝔼 t,x0,ϵ [ ϵ − ϵθ ( ᾱtx0 + 1 − ᾱtϵ, t) 2 ] is a linear combination of and xt x0 ϵ q (xt ∣ x0) = 𝒩 (xt; ᾱtx0, (1 − ᾱt) I) , where  αt := 1 − βt and ᾱt := t ∏ s=1 αs Posterior q (xt−1 ∣ xt, x0) = 𝒩 (xt−1; μ̃t (xt, x0), β̃tI) Loss Function https://github.com/rosinality/denoising-di ff usion-pytorch/blob/master/ di ff usion.py generate ϵ sample xt predict ϵ Predict (or ) at each step ϵ x0
  • 18.
    DDPM - OutputSamples Ho, Jonathan, Ajay Jain, and Pieter Abbeel. "Denoising diffusion probabilistic models." Advances in Neural Information Processing Systems 33 (2020): 6840-6851.
  • 19.
    https://github.com/yang-song/score_sde_pytorch/ 
 Song, Yang, etal. "Score-based generative modeling through stochastic differential equations." arXiv preprint arXiv:2011.13456 (2020). DDPM - Output Samples
  • 20.
    Diffusion vs. ... "Weemphasize that our objective Eq. (6) requires no adversarial training, no surrogate losses, and nosampling from the score network during training (e.g., unlike contrastive divergence). Also, it does not require sθ(x, σ) to have special architectures in order to be tractable." 
 Song, Yang, and Stefano Ermon. "Generative modeling by estimating gradients of the data distribution." Advances in Neural Information Processing Systems 32 (2019). "We present a novel way to define probabilistic models that allows: 
 1. extreme flexibility in model structure, 2. exact sampling, 3. easy multiplication with other distributions, e.g. in order to compute a posterior, and 4. the model log likelihood, and the probability of individual states, to be cheaply evaluated." Sohl-Dickstein, Jascha, et al. "Deep unsupervised learning using nonequilibrium thermodynamics." International Conference on Machine Learning. PMLR, 2015.
  • 21.
    Diffusion vs. ... Tractability Flexibility Auto- 
 Regressive VAEFlow GAN Diffusion Good Good Good Not Good Good Not Good 
 
 Causal structure Fixed distribution Not Good 
 Dimension reduction Fixed distribution Likelihood 
 can't be evaluated Not Good 
 Invertible structure 
 Fixed distribution Good Good
  • 22.
  • 23.
    GLIDE - OutputSamples Nichol, Alex, et al. "Glide: Towards photorealistic image generation and editing with text-guided diffusion models." arXiv preprint arXiv:2112.10741 (2021). - OpenAI에서 논문 Glide: Towards photorealistic image generation and editing with text-guided diffusion models을 통해 제안
  • 24.
    GLIDE - OverallArchitecture Nichol, Alex, et al. "Glide: Towards photorealistic image generation and editing with text-guided diffusion models." arXiv preprint arXiv:2112.10741 (2021). A hedgehog using a calculator Text Encoder Encoding Sequence ResBlock Attention Down Layer ... ... Mid ... GLIDE source : https://github.com/openai/glide-text2im Down or Up Up Layer ... xt ϵ AdaIn or Add Attention U-net Layer (Batch, Channel, Length) xt+1 xT ⋯ ⋯ xt−1 x0 ⋯ ⋯
  • 25.
    CLIP (Contrastive Language-ImagePre-training) - (image, text) 쌍의 데이터로 self-supervised learning을 통해 text/image encoder 를 학습 Radford, Alec, et al. "Learning transferable visual models from natural language supervision." International Conference on Machine Learning. PMLR, 2021.
  • 26.
    CLIP (Contrastive Language-ImagePre-training) - (image, text) 쌍의 데이터로 self-supervised learning을 통해 text/image encoder 를 학습 Radford, Alec, et al. "Learning transferable visual models from natural language supervision." International Conference on Machine Learning. PMLR, 2021.
  • 27.
    CLIP (Contrastive Language-ImagePre-training) https://openai.com/blog/clip/
  • 28.
    DALL·E 2 -Output Samples - OpenAI에서 Hierarchical text-conditional image generation with clip latents 논 문을 통해 제안 Ramesh, Aditya, et al. "Hierarchical text-conditional image generation with clip latents." arXiv preprint arXiv:2204.06125 (2022).
  • 29.
    DALL·E 2 -Overall Architecture Ramesh, Aditya, et al. "Hierarchical text-conditional image generation with clip latents." arXiv preprint arXiv:2204.06125 (2022).
  • 30.
    DALL·E 2 -Prior - GLIDE text embedding과 CLIP text embedding으로부터 CLIP image embedding 을 생성 Diffusion Backbone 
 (Transformer Decoder) Text encoding sequence 
 (Batch, Channel, Length) CLIP text embedding Diffusion timestep Noised CLIP image embedding Final embedding Unnoised CLIP image embedding Ramesh, Aditya, et al. "Hierarchical text-conditional image generation with clip latents." arXiv preprint arXiv:2204.06125 (2022).
  • 31.
    Guided Diffusion Sampling Classifier-freeDiffusion Guidance 
 CLIP Guidance
  • 32.
    Guided Diffusion Sampling -Diffusion Models Beat GANs on Image Synthesis 논문에서 제안 - Diffusion 모델 외 추가적인 image classifier를 학습시키고 sampling 과정에서 classifier로부터 gradient를 받아 sampling에 도움을 줌 Dhariwal, Prafulla, and Alexander Nichol. "Diffusion models beat gans on image synthesis." Advances in Neural Information Processing Systems 34 (2021): 8780-8794. ̂ μθ (xt ∣ y) = μθ (xt ∣ y) + s ⋅ Σθ (xt ∣ y)∇xt log pϕ (y ∣ xt) Posterior q (xt−1 ∣ xt, x0) = 𝒩 (xt−1; μ̃t (xt, x0), β̃tI) Nichol, Alex, et al. "Glide: Towards photorealistic image generation and editing with text-guided diffusion models." arXiv preprint arXiv:2112.10741 (2021). Nichol, Alex, et al.에서 재인용 수정된 mean classi fi er로부터 전해진 gradient guidance
 scale 기존 mean 기존 covariance xt μθ (xt ∣ y) ∇xt log pϕ (y ∣ xt)
  • 33.
    Guided Diffusion Sampling Dhariwal,Prafulla, and Alexander Nichol. "Diffusion models beat gans on image synthesis." Advances in Neural Information Processing Systems 34 (2021): 8780-8794.
  • 34.
    Classifier-free guidance - Ho,Jonathan과 Tim Salimans의 논문 Classifier-free diffusion guidance에서 제안 - 추가적인 classifier를 트레이닝할 필요없이 diffusion 모델만 가지고 guided sampling을 가능하게 만 듬 Ho, Jonathan, and Tim Salimans. "Classifier-free diffusion guidance." NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications. 2021. Nichol, Alex, et al. "Glide: Towards photorealistic image generation and editing with text-guided diffusion models." arXiv preprint arXiv:2112.10741 (2021). Nichol, Alex, et al.에서 재인용 ̂ ϵθ (xt ∣ y) = ϵθ (xt ∣ y) + s ⋅ (ϵθ (xt ∣ y) − ϵθ (xt ∣ ∅)) unconditional
 predicted score conditional
 predicted score ϵθ (xt ∣ y) ϵθ (xt ∣ y) − ϵθ (xt ∣ ∅) ≈ − σt ∇xt log pi (y ∣ xt) Diffusion Backbone (U-net) xt , or t y ∅ ϵθ ϵθ (xt ∣ ∅) guidance
 scale 수정된 score
  • 35.
    Classifier-free guidance Ho, Jonathan,and Tim Salimans. "Classifier-free diffusion guidance." NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications. 2021.
  • 36.
    CLIP Guidance - Classifier를이용한 guidance 방법과 비슷하게 CLIP 모델을 이용하여 sampling 단계 에서 도움을 줌 Nichol, Alex, et al. "Glide: Towards photorealistic image generation and editing with text-guided diffusion models." arXiv preprint arXiv:2112.10741 (2021). ̂ μθ (xt ∣ c) = μθ (xt ∣ c) + s ⋅ Σθ (xt ∣ c)∇xt(f (xt) ⋅ g(c)) CLIP
 image encoding CLIP
 text encoding xt μθ (xt ∣ c) ∇xt(f (xt) ⋅ g(c))
  • 37.
    Classfier-free Guidance vs.CLIP Guidance in GLIDE Nichol, Alex, et al. "Glide: Towards photorealistic image generation and editing with text-guided diffusion models." arXiv preprint arXiv:2112.10741 (2021).
  • 38.