Generative adversarial network and its applications to speech signal and natural language processing (INTERSPEECH 2019 tutorial)
The document discusses Generative Adversarial Networks (GANs), outlining their basic principles, theoretical foundations, and diverse applications in speech and natural language processing. It covers different types of GANs, including traditional and conditional models, and explores their utility in generating images, video, and audio-to-image translations. Additionally, it references significant research contributions and advancements in the field since the inception of GANs.
Introduction to Generative Adversarial Networks (GANs), their diversity and historical context, with reference to over 500 types in the GAN ecosystem.
Foundation of GAN, including generators and discriminators, types of GANs, and an illustration on image generation.
Stepwise explanation of GAN training processes with illustrations, showcasing generator and discriminator updates during iterations.
Introduction of advanced GANs like StyleGAN, highlighting their evolution and improvements in image generation techniques.
Understanding the internal mechanisms of GANs, including loss functions, evaluation metrics, conditional GANs, and practical applications.
Applications of Conditional GANs in image-to-image translation, showcasing improvements and metrics comparisons for multiple architectures.
Exploring unsupervised conditional GANs, domain adaptation methods, and various approaches like CycleGAN for feature transformation.
Applications of GANs in speech processing, including speech enhancement, recognition tasks, and examples of techniques like SEGAN. Techniques in voice conversion through GANs, highlighting the use of GANs for improving speech intelligibility in various scenarios.
Discussion on GANs in NLP tasks, focusing on sequence generation, text style transfer, and summarization, with references to empirical results.
Empirical performance insights of GANs in NLP, outlining summary generation, translation, and evaluation methodologies in unsupervised settings.
Outline
Part I: BasicIdea of Generative Adversarial
Network (GAN)
Part II: A little bit theory
Part III: Applications to Speech Processing
Part IV: Applications to Natural Language
Processing
Take a break
3.
All Kinds ofGAN … https://github.com/hindupuravinash/the-gan-zoo
(not updated since 2018.09)
More than 500 species
in the zoo
4.
All Kinds ofGAN … https://github.com/hindupuravinash/the-gan-zoo
GAN
ACGAN
BGAN
DCGAN
EBGAN
fGAN
GoGAN
CGAN
……
Mihaela Rosca, Balaji Lakshminarayanan, David Warde-Farley, Shakir Mohamed, “Variational Approaches for Auto-Encoding
Generative Adversarial Networks”, arXiv, 2017
5.
0 0 00 1 2
42
62
0 0 0 0
2
11
14
32
2012 2013 2014 2015 2016 2017 2018 2019
ICASSP INTERSPEECH
INTERSPEECH & ICASSP
How many papers have “adversarial” in their titles?
It is a wise choice to
attend this tutorial.
Generator
“Girl with
red hair”
Generator
−0.3
0.1
⋮
0.9
randomvector
Three Categories of GAN
1. Generation
image
2. Conditional Generation
Generator
text
imagepaired data
blue eyes,
red hair,
short hair
3. Unsupervised Conditional Generation
Photo Vincent van
Gogh’s styleunpaired data
x ydomain x domain y
Basic Idea ofGAN
Generator
It is a neural network
(NN), or a function.
Generator
0.1
−3
⋮
2.4
0.9
imagevector
Generator
3
−3
⋮
2.4
0.9
Generator
0.1
2.1
⋮
5.4
0.9
Generator
0.1
−3
⋮
2.4
3.5
high
dimensional
vector
Powered by: http://mattya.github.io/chainer-DCGAN/
Each dimension of input vector
represents some characteristics.
Longer hair
blue hair Open mouth
10.
Discri-
minator
scalar
image
Basic Idea ofGAN It is a neural network
(NN), or a function.
Larger value means real,
smaller value means fake.
Discri-
minator
Discri-
minator
Discri-
minator1.0 1.0
0.1 Discri-
minator
0.1
11.
• Initialize generatorand discriminator
• In each training iteration:
DG
sample
generated
objects
G
Algorithm
D
Update
vector
vector
vector
vector
0000
1111
randomly
sampled
Database
Step 1: Fix generator G, and update discriminator D
Discriminator learns to assign high scores to real objects
and low scores to generated objects.
Fix
12.
• Initialize generatorand discriminator
• In each training iteration:
DG
Algorithm
Step 2: Fix discriminator D, and update generator G
Discri-
minator
NN
Generator
vector
0.13
hidden layer
update fix
Gradient Ascent
large network
Generator learns to “fool” the discriminator
13.
• Initialize generatorand discriminator
• In each training iteration:
DG
Learning
D
Sample some
real objects:
Generate some
fake objects:
G
Algorithm
D
Update
Learning
G
G D
image
1111
image
image
image
1
update fix
0000vector
vector
vector
vector
vector
vector
vector
vector
fix
[David Bau, etal., ICLR 2019]
Does the generator have the concept
of objects?
Some neurons correspond to specific
objects, for example, tree
27.
Remove the neuronsfor tree
[David Bau, et al., ICLR 2019]
Activate the neurons for tree
28.
Generator
“Girl with
red hair”
Generator
−0.3
0.1
⋮
0.9
randomvector
Three Categories of GAN
1. Generation
image
2. Conditional Generation
Generator
text
imagepaired data
blue eyes,
red hair,
short hair
3. Unsupervised Conditional Generation
Photo Vincent van
Gogh’s styleunpaired data
x ydomain x domain y
29.
Target of
NN output
Text-to-Image
•Traditional supervised approach
NN Image
Text: “train”
a dog is running
a bird is flying
A blurry image!
c1: a dog is running
as close as
possible
30.
Conditional GAN
D
(original)
scalar𝑥
G
𝑧Normal distribution
x= G(c,z)
c: train
x is real image or not
Image
Real images:
Generated images:
1
0
Generator will learn to
generate realistic images ….
But completely ignore the
input conditions.
[Scott Reed, et al, ICML, 2016]
31.
Conditional GAN
D
(better)
scalar
𝑐
𝑥
True text-imagepairs:
G
𝑧Normal distribution
x = G(c,z)
c: train
Image
x is realistic or not +
c and x are matched or not
(train , )
(train , )(cat , )
[Scott Reed, et al, ICML, 2016]
1
00
32.
x is realisticor not +
c and x are matched
or not
Conditional GAN - Discriminator
[Takeru Miyato, et al., ICLR, 2018]
[Han Zhang, et al., arXiv, 2017]
[Augustus Odena et al., ICML, 2017]
condition c
object x
Network
Network
Network
score
Network
Network
(almost every paper)
condition c
object x
c and x are matched
or not
x is realistic or not
+
33.
Conditional GAN
paired data
blueeyes
red hair
short hair
Collecting anime faces
and the description of its
characteristics
red hair,
green eyes
blue hair,
red eyes
The images are generated by
Yen-Hao Chen, Po-Chun Chien,
Jun-Chen Xie, Tsung-Han Wu.
34.
Conditional GAN -Image-to-image
G
𝑧
x = G(c,z)
𝑐
[Phillip Isola, et al., CVPR, 2017]
Image translation, or pix2pix
35.
as close as
possible
ConditionalGAN - Image-to-image
• Traditional supervised approach
NN Image
It is blurry.
Testing:
input L1
e.g. L1
[Phillip Isola, et al., CVPR, 2017]
36.
Conditional GAN -Image-to-image
Testing:
input L1 GAN
G
𝑧
Image D scalar
GAN + L1
L1
[Phillip Isola, et al., CVPR, 2017]
Conditional GAN
- Sound-to-image
•Audio-to-image
https://wjohn1483.github.io/
audio_to_scene/index.html
The images are generated by Chia-
Hung Wan and Shun-Po Chuang.
Louder
Conditional GAN -Image-to-label
F1 MS-COCO NUS-WIDE
VGG-16 56.0 33.9
+ GAN 60.4 41.2
Inception 62.4 53.5
+GAN 63.8 55.8
Resnet-101 62.8 53.1
+GAN 64.0 55.4
Resnet-152 63.3 52.1
+GAN 63.9 54.1
Att-RNN 62.1 54.7
RLSD 62.0 46.9
The classifiers can have
different architectures.
The classifiers are
trained as conditional
GAN.
[Tsai, et al., ICASSP 2019]
41.
Conditional GAN -Image-to-label
F1 MS-COCO NUS-WIDE
VGG-16 56.0 33.9
+ GAN 60.4 41.2
Inception 62.4 53.5
+GAN 63.8 55.8
Resnet-101 62.8 53.1
+GAN 64.0 55.4
Resnet-152 63.3 52.1
+GAN 63.9 54.1
Att-RNN 62.1 54.7
RLSD 62.0 46.9
The classifiers can have
different architectures.
The classifiers are
trained as conditional
GAN.
Conditional GAN
outperforms other
models designed for
multi-label.
42.
Conditional GAN
- VideoGeneration
Generator
Discrimi
nator
Last frame is real
or generated
Discriminator thinks it is real
[Michael Mathieu, et al., arXiv, 2015]
More about VideoGeneration
https://arxiv.org/abs/1905.08233
[Egor Zakharov, et al., arXiv, 2019]
45.
Domain Adversarial Training
•Training and testing data are in different domains
Training
data:
Testing
data:
Generator
Generator
The same
distribution
feature
feature
Take digit
classification as example
46.
blue points
red points
DomainAdversarial Training
feature extractor (Generator)
Discriminator
(Domain classifier)
image Which domain?
Always output
zero vectors
Domain Classifier Fails
47.
Domain Adversarial Training
featureextractor (Generator)
Discriminator
(Domain classifier)
image
Label predictor
Which digits?
Not only cheat the domain
classifier, but satisfying label
predictor at the same time
More speech-related applications in Part III.
Successfully applied on image classification
[Ganin et al, ICML, 2015][Ajakan et al. JMLR, 2016 ]
Which domain?
48.
Generator
“Girl with
red hair”
Generator
−0.3
0.1
⋮
0.9
randomvector
Three Categories of GAN
1. Generation
image
2. Conditional Generation
Generator
text
imagepaired data
blue eyes,
red hair,
short hair
3. Unsupervised Conditional Generation
Photo Vincent van
Gogh’s styleunpaired data
x ydomain x domain y
49.
Unsupervised
Conditional Generation
G
Object inDomain X Object in Domain Y
Transform an object from one domain to another
without paired data
Domain X Domain Y
photos
Condition Generated Object
Vincent van Gogh’s
paintings
Not Paired
More Applications in Parts III and IV
Use image style transfer as example here
50.
Unsupervised
Conditional Generation
• Approach1: Cycle-GAN and its variants
• Approach 2: Shared latent space
?𝐺 𝑋→𝑌
Domain X Domain Y
𝐸𝑁 𝑋 𝐷𝐸 𝑌
Encoder of
domain X
Decoder of
domain Y
Domain YDomain X Face
Attribute
51.
?
Cycle GAN
𝐺 𝑋→𝑌
DomainX
Domain Y
𝐷 𝑌
Domain Y
Domain X
scalar
Input image
belongs to
domain Y or not
Become similar
to domain Y
52.
Cycle GAN
𝐺 𝑋→𝑌
DomainX
Domain Y
𝐷 𝑌
Domain Y
Domain X
scalar
Input image
belongs to
domain Y or not
Become similar
to domain Y
Not what we want!
ignore input
53.
Cycle GAN
𝐺 𝑋→𝑌
DomainX
Domain Y
𝐷 𝑌
Domain X
scalar
Input image
belongs to
domain Y or not
Become similar
to domain Y
Not what we want!
ignore input
[Tomer Galanti, et al. ICLR, 2018]
The issue can be avoided by network design.
Simpler generator makes the input and
output more closely related.
54.
Cycle GAN
𝐺 𝑋→𝑌
DomainX
Domain Y
𝐷 𝑌
Domain X
scalar
Input image
belongs to
domain Y or not
Become similar
to domain Y
Encoder
Network
Encoder
Network
pre-trained
as close as
possible
Baseline of DTN [Yaniv Taigman, et al., ICLR, 2017]
55.
Cycle GAN
𝐺 𝑋→𝑌
𝐷𝑌
Domain Y
scalar
Input image
belongs to
domain Y or not
𝐺Y→X
as close as possible
Lack of information
for reconstruction
[Jun-Yan Zhu, et al., ICCV, 2017]
Cycle consistency
56.
Cycle GAN
𝐺 𝑋→𝑌𝐺Y→X
as close as possible
𝐺Y→X 𝐺 𝑋→𝑌
as close as possible
𝐷 𝑌𝐷 𝑋
scalar: belongs to
domain Y or not
scalar: belongs to
domain X or not
57.
Cycle GAN
Dual GAN
DiscoGAN
[Jun-Yan Zhu, et al., ICCV, 2017]
[Zili Yi, et al., ICCV, 2017]
[Taeksoo Kim, et
al., ICML, 2017]
For multiple domains,
considering starGAN
[Yunjey Choi, arXiv, 2017]
58.
Issue of CycleConsistency
• CycleGAN: a Master of Steganography
[Casey Chu, et al., NIPS workshop, 2017]
𝐺Y→X𝐺 𝑋→𝑌
The information is hidden.
59.
Unsupervised
Conditional Generation
• Approach1: Cycle-GAN and its variants
• Approach 2: Shared latent space
?𝐺 𝑋→𝑌
Domain X Domain Y
𝐸𝑁 𝑋 𝐷𝐸 𝑌
Encoder of
domain X
Decoder of
domain Y
Domain YDomain X Face
Attribute
60.
Domain X DomainY
𝐸𝑁 𝑋
𝐸𝑁𝑌 𝐷𝐸 𝑌
𝐷𝐸 𝑋image
image
image
imageFace
Attribute
Shared latent space
Target
- domain-x information
+ domain-y information
61.
Domain X DomainY
𝐸𝑁 𝑋
𝐸𝑁𝑌 𝐷𝐸 𝑌
𝐷𝐸 𝑋image
image
image
image
Minimizing reconstruction error
Shared latent space
Training
62.
𝐸𝑁 𝑋
𝐸𝑁𝑌 𝐷𝐸𝑌
𝐷𝐸 𝑋image
image
image
image
Minimizing reconstruction error
Because we train two auto-encoders separately …
The images with the same attribute may not project
to the same position in the latent space.
𝐷 𝑋
𝐷 𝑌
Discriminator
of X domain
Discriminator
of Y domain
Minimizing reconstruction error
Shared latent space
Training
63.
𝐸𝑁 𝑋
𝐸𝑁𝑌 𝐷𝐸𝑌
𝐷𝐸 𝑋image
image
image
image
Minimizing reconstruction error
The domain discriminator forces the output of 𝐸𝑁𝑋 and
𝐸𝑁𝑌 have the same distribution.
From 𝐸𝑁𝑋 or 𝐸𝑁𝑌
𝐷 𝑋
𝐷 𝑌
Discriminator
of X domain
Discriminator
of Y domain
Shared latent space
Training
Domain
Discriminator
𝐸𝑁𝑋 and 𝐸𝑁𝑌 fool the
domain discriminator
[Guillaume Lample, et al., NIPS, 2017]
64.
𝐸𝑁 𝑋
𝐸𝑁𝑌 𝐷𝐸𝑌
𝐷𝐸 𝑋image
image
image
image
𝐷 𝑋
𝐷 𝑌
Discriminator
of X domain
Discriminator
of Y domain
Shared latent space
Training
Cycle Consistency:
Used in ComboGAN [Asha Anoosheh, et al., arXiv, 017]
Minimizing reconstruction error
65.
𝐸𝑁 𝑋
𝐸𝑁𝑌 𝐷𝐸𝑌
𝐷𝐸 𝑋image
image
image
image
𝐷 𝑋
𝐷 𝑌
Discriminator
of X domain
Discriminator
of Y domain
Shared latent space
Training
Semantic Consistency:
Used in DTN [Yaniv Taigman, et al., ICLR, 2017] and
XGAN [Amélie Royer, et al., arXiv, 2017]
To the same
latent space
66.
Sharing the parametersof encoders and decoders
Shared latent space
𝐸𝑁 𝑋
𝐸𝑁𝑌
𝐷𝐸 𝑋
𝐷𝐸 𝑌
Couple GAN[Ming-Yu Liu, et al., NIPS, 2016]
UNIT[Ming-Yu Liu, et al., NIPS, 2017]
67.
Shared latent space
𝐸𝑁𝑋
𝐸𝑁𝑌
𝐷𝐸 𝑋
𝐷𝐸 𝑌
One encoder to extract domain-
independent information
Input an extra indicator
to control the decoder
x or y
Widely used in Voice Conversion
(Part III)
Generator
“Girl with
red hair”
Generator
−0.3
0.1
⋮
0.9
randomvector
Three Categories of GAN
1. Typical GAN
image
2. Conditional GAN
Generator
text
imagepaired data
blue eyes,
red hair,
short hair
3. Unsupervised Conditional GAN
Photo Vincent van
Gogh’s styleunpaired data
x ydomain x domain y
70.
Reference
• Generation
• IanJ. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-
Farley, Sherjil Ozair, Aaron Courville, Yoshua Bengio, Generative Adversarial
Nets, NIPS, 2014
• Tero Karras, Timo Aila, Samuli Laine, Jaakko Lehtinen, Progressive Growing
of GANs for Improved Quality, Stability, and Variation, ICLR, 2018
• Andrew Brock, Jeff Donahue, Karen Simonyan, Large Scale GAN Training for
High Fidelity Natural Image Synthesis, arXiv, 2018
• David Bau, Jun-Yan Zhu, Hendrik Strobelt, Bolei Zhou, Joshua B.
Tenenbaum, William T. Freeman, Antonio Torralba, GAN Dissection:
Visualizing and Understanding Generative Adversarial Networks, ICLR 2019
71.
Reference
• Conditional Generation
•Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt
Schiele, Honglak Lee, Generative Adversarial Text to Image Synthesis, ICML,
2016
• Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, Alexei A. Efros, Image-to-Image
Translation with Conditional Adversarial Networks, CVPR, 2017
• Michael Mathieu, Camille Couprie, Yann LeCun, Deep multi-scale video
prediction beyond mean square error, arXiv, 2015
• Mehdi Mirza, Simon Osindero, Conditional Generative Adversarial Nets,
arXiv, 2014
• Takeru Miyato, Masanori Koyama, cGANs with Projection Discriminator,
ICLR, 2018
• Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei
Huang, Dimitris Metaxas, StackGAN++: Realistic Image Synthesis with
Stacked Generative Adversarial Networks, arXiv, 2017
• Augustus Odena, Christopher Olah, Jonathon Shlens, Conditional Image
Synthesis With Auxiliary Classifier GANs, ICML, 2017
72.
Reference
• Conditional Generation
•Yaroslav Ganin, Victor Lempitsky, Unsupervised Domain Adaptation by
Backpropagation, ICML, 2015
• Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario
Marchand, Domain-Adversarial Training of Neural Networks, JMLR, 2016
• Che-Ping Tsai, Hung-Yi Lee, Adversarial Learning of Label Dependency: A
Novel Framework for Multi-class Classification, submitted to ICASSP 2019
• Egor Zakharov, Aliaksandra Shysheya, Egor Burkov, Victor Lempitsky, Few-
Shot Adversarial Learning of Realistic Neural Talking Head Models, arXiv
2019
• Chia-Hung Wan, Shun-Po Chuang, Hung-Yi Lee, "Towards Audio to Scene
Image Synthesis using Generative Adversarial Network", ICASSP, 2019
73.
Reference
• Unsupervised ConditionalGeneration
• Jun-Yan Zhu, Taesung Park, Phillip Isola, Alexei A. Efros, Unpaired Image-to-
Image Translation using Cycle-Consistent Adversarial Networks, ICCV, 2017
• Zili Yi, Hao Zhang, Ping Tan, Minglun Gong, DualGAN: Unsupervised Dual
Learning for Image-to-Image Translation, ICCV, 2017
• Tomer Galanti, Lior Wolf, Sagie Benaim, The Role of Minimal Complexity
Functions in Unsupervised Learning of Semantic Mappings, ICLR, 2018
• Yaniv Taigman, Adam Polyak, Lior Wolf, Unsupervised Cross-Domain Image
Generation, ICLR, 2017
• Asha Anoosheh, Eirikur Agustsson, Radu Timofte, Luc Van Gool, ComboGAN:
Unrestrained Scalability for Image Domain Translation, arXiv, 2017
• Amélie Royer, Konstantinos Bousmalis, Stephan Gouws, Fred Bertsch, Inbar
Mosseri, Forrester Cole, Kevin Murphy, XGAN: Unsupervised Image-to-
Image Translation for Many-to-Many Mappings, arXiv, 2017
74.
Reference
• Unsupervised ConditionalGeneration
• Guillaume Lample, Neil Zeghidour, Nicolas Usunier, Antoine Bordes, Ludovic
Denoyer, Marc'Aurelio Ranzato, Fader Networks: Manipulating Images by
Sliding Attributes, NIPS, 2017
• Taeksoo Kim, Moonsu Cha, Hyunsoo Kim, Jung Kwon Lee, Jiwon Kim,
Learning to Discover Cross-Domain Relations with Generative Adversarial
Networks, ICML, 2017
• Ming-Yu Liu, Oncel Tuzel, “Coupled Generative Adversarial Networks”, NIPS,
2016
• Ming-Yu Liu, Thomas Breuel, Jan Kautz, Unsupervised Image-to-Image
Translation Networks, NIPS, 2017
• Yunjey Choi, Minje Choi, Munyoung Kim, Jung-Woo Ha, Sunghun Kim,
Jaegul Choo, StarGAN: Unified Generative Adversarial Networks for Multi-
Domain Image-to-Image Translation, arXiv, 2017
Outline of PartII
Basic Theory of GAN
Helpful Tips
How to evaluate GAN
Relation to Reinforcement Learning
77.
Generator
• A generatorG is a network. The network defines a
probability distribution 𝑃𝐺
generator
G𝑧 𝑥 = 𝐺 𝑧
Normal
Distribution
𝑃𝐺(𝑥) 𝑃𝑑𝑎𝑡𝑎 𝑥
as close as possible
How to compute the divergence?
𝐺∗ = 𝑎𝑟𝑔 min
𝐺
𝐷𝑖𝑣 𝑃𝐺, 𝑃𝑑𝑎𝑡𝑎
Divergence between distributions 𝑃𝐺 and 𝑃𝑑𝑎𝑡𝑎
𝑥: an image (a high-
dimensional vector)
78.
Discriminator
𝐺∗
= 𝑎𝑟𝑔 min
𝐺
𝐷𝑖𝑣𝑃𝐺, 𝑃𝑑𝑎𝑡𝑎
Although we do not know the distributions of 𝑃𝐺 and 𝑃𝑑𝑎𝑡𝑎,
we can sample from them.
sample
G
vector
vector
vector
vector
sample from
normal
Database
Sampling from 𝑷 𝑮
Sampling from 𝑷 𝒅𝒂𝒕𝒂
79.
Discriminator 𝐺∗
= 𝑎𝑟𝑔min
𝐺
𝐷𝑖𝑣 𝑃𝐺, 𝑃𝑑𝑎𝑡𝑎
Discriminator
: data sampled from 𝑃𝑑𝑎𝑡𝑎
: data sampled from 𝑃𝐺
train
𝑉 𝐺, 𝐷 = 𝐸 𝑥∼𝑃 𝑑𝑎𝑡𝑎
𝑙𝑜𝑔𝐷 𝑥 + 𝐸 𝑥∼𝑃 𝐺
𝑙𝑜𝑔 1 − 𝐷 𝑥
Example Objective Function for D
(G is fixed)
𝐷∗ = 𝑎𝑟𝑔 max
𝐷
𝑉 𝐷, 𝐺Training:
Using the example objective
function is exactly the same as
training a binary classifier.
[Goodfellow, et al., NIPS, 2014]
The maximum objective value
is related to JS divergence.
80.
Discriminator 𝐺∗
= 𝑎𝑟𝑔min
𝐺
𝐷𝑖𝑣 𝑃𝐺, 𝑃𝑑𝑎𝑡𝑎
Discriminator
: data sampled from 𝑃𝑑𝑎𝑡𝑎
: data sampled from 𝑃𝐺
train
hard to discriminatesmall divergence
Discriminator
train
easy to discriminatelarge divergence
𝐷∗ = 𝑎𝑟𝑔 max
𝐷
𝑉 𝐷, 𝐺
Training:
Small max
𝐷
𝑉 𝐷, 𝐺
81.
𝐺∗
= 𝑎𝑟𝑔 min
𝐺
𝐷𝑖𝑣𝑃𝐺, 𝑃𝑑𝑎𝑡𝑎max
𝐷
𝑉 𝐺, 𝐷
The maximum objective value
is related to JS divergence.
• Initialize generator and discriminator
• In each training iteration:
Step 1: Fix generator G, and update discriminator D
Step 2: Fix discriminator D, and update generator G
𝐷∗ = 𝑎𝑟𝑔 max
𝐷
𝑉 𝐷, 𝐺
[Goodfellow, et al., NIPS, 2014]
82.
Using the divergence
youlike ☺
[Sebastian Nowozin, et al., NIPS, 2016]
Can we use other divergence?
83.
Outline of PartII
Basic Theory of GAN
Helpful Tips
How to evaluate GAN
Relation to Reinforcement Learning
84.
GAN is difficultto train ……
• There is a saying ……
(I found this joke from 陳柏文’s facebook.)
85.
Too many tips……
• I do a little survey among 12 students …..
Q: What is the most helpful tip for
training GAN?
WGAN (33.3%)
Spectral Norm (16.7%)
86.
JS divergence isnot suitable
• In most cases, 𝑃𝐺 and 𝑃𝑑𝑎𝑡𝑎 are not overlapped.
• 1. The nature of data
• 2. Sampling
Both 𝑃𝑑𝑎𝑡𝑎 and 𝑃𝐺 are low-dim
manifold in high-dim space.
𝑃𝑑𝑎𝑡𝑎
𝑃𝐺
The overlap can be ignored.
Even though 𝑃𝑑𝑎𝑡𝑎 and 𝑃𝐺
have overlap.
If you do not have enough
sampling ……
87.
𝑃𝑑𝑎𝑡𝑎𝑃𝐺0 𝑃𝑑𝑎𝑡𝑎𝑃𝐺1
𝐽𝑆 𝑃𝐺0
,𝑃𝑑𝑎𝑡𝑎
= 𝑙𝑜𝑔2
𝑃𝑑𝑎𝑡𝑎𝑃𝐺100
……
𝐽𝑆 𝑃𝐺1
, 𝑃𝑑𝑎𝑡𝑎
= 𝑙𝑜𝑔2
𝐽𝑆 𝑃𝐺100
, 𝑃𝑑𝑎𝑡𝑎
= 0
What is the problem of JS divergence?
……
JS divergence is log2 if two distributions do not overlap.
Intuition: If two distributions do not overlap, binary classifier
achieves 100% accuracy
Equally bad
The same max objective value is
obtained.
Same divergence
88.
Wasserstein distance
• Consideringone distribution P as a pile of earth,
and another distribution Q as the target
• The average distance the earth mover has to move
the earth.
𝑃 𝑄
d
𝑊 𝑃, 𝑄 = 𝑑
89.
Wasserstein distance
Source ofimage: https://vincentherrmann.github.io/blog/wasserstein/
𝑃
𝑄
Using the “moving plan” with the smallest average distance to
define the Wasserstein distance.
There are many possible “moving plans”.
Smaller
distance?
Larger
distance?
WGAN
max
𝐷∈1−𝐿𝑖𝑝𝑠𝑐ℎ𝑖𝑡𝑧
𝐸 𝑥~𝑃 𝑑𝑎𝑡𝑎
𝐷𝑥 − 𝐸 𝑥~𝑃 𝐺
𝐷 𝑥
Evaluate Wasserstein distance between 𝑃𝑑𝑎𝑡𝑎 and 𝑃𝐺
[Martin Arjovsky, et al., arXiv, 2017]
How to fulfill this constraint?D has to be smooth enough.
real
−∞
generated
D
∞
Without the constraint, the
training of D will not converge.
Keeping the D smooth forces
D(x) become ∞ and −∞
92.
• Original WGAN→ Weight Clipping [Martin Arjovsky, et al.,
arXiv, 2017]
• Improved WGAN → Gradient Penalty [Ishaan Gulrajani,
NIPS, 2017]
• Spectral Normalization → Keep gradient norm
smaller than 1 everywhere [Miyato, et al., ICLR, 2018]
Force the parameters w between c and -c
After parameter update, if w > c, w = c; if w < -c, w = -c
Keep the gradient close to 1
max
𝐷∈1−𝐿𝑖𝑝𝑠𝑐ℎ𝑖𝑡𝑧
𝐸 𝑥~𝑃 𝑑𝑎𝑡𝑎
𝐷 𝑥 − 𝐸 𝑥~𝑃 𝐺
𝐷 𝑥
real
samples
Keep the gradient
close to 1
[Kodali, et al., arXiv, 2017]
[Wei, et al., ICLR, 2018]
93.
More Tips
• Improvedtechniques for training GANs
• Tips in DCGAN [Alec Radford, et al., ICLR 2016]
• Guideline for network architecture design for
image generation
• Tips from Soumith
• https://github.com/soumith/ganhacks
• Tips from BigGAN [Andrew Brock, et al., arXiv, 2018]
[Tim Salimans, et al., NIPS, 2016]
94.
Outline of PartII
Basic Theory of GAN
Helpful Tips
How to evaluate GAN
Relation to Reinforcement Learning
95.
Inception Score
Off-the-shelf
Image Classifier
𝑥𝑃 𝑦|𝑥
Concentrated distribution
means higher visual quality
CNN𝑥1
𝑃 𝑦1|𝑥1
Uniform distribution
means higher variety
CNN𝑥2
𝑃 𝑦2|𝑥2
CNN𝑥3
𝑃 𝑦3|𝑥3
…
𝑃 𝑦 =
1
𝑁
𝑛
𝑃 𝑦 𝑛|𝑥 𝑛
[Tim Salimans, et al., NIPS, 2016]
𝑥: image
𝑦: class (output of CNN)
e.g. Inception net,
VGG, etc.
class 1
class 2
class 3
96.
Inception Score
=
𝑥
𝑦
𝑃𝑦|𝑥 𝑙𝑜𝑔𝑃 𝑦|𝑥
−
𝑦
𝑃 𝑦 𝑙𝑜𝑔𝑃 𝑦
Negative entropy of P(y|x)
Entropy of P(y)
Inception Score
𝑃 𝑦 =
1
𝑁
𝑛
𝑃 𝑦 𝑛|𝑥 𝑛
𝑃 𝑦|𝑥
class 1
class 2
class 3
[Tim Salimans, et al., NIPS 2016]
97.
Fréchet Inception Distance(FID)
blue points: latent representation of Inception net for
the generated images
red points: latent representation of Inception net for
the read images
FID =
Fréchet distance
between the two
Gaussians
[Martin Heusel, et al., NIPS, 2017]
98.
To learn moreabout evaluation …
Pros and cons of GAN evaluation measures
https://arxiv.org/abs/1802.03446
[Ali Borji, 2019]
99.
Outline of PartII
Basic Theory of GAN
Helpful Tips
How to evaluate GAN
Relation to Reinforcement Learning
• Input ofneural network: the observation of machine
represented as a vector or a matrix
• Output neural network : each action corresponds to a
neuron in output layer
…
…
NN as actor
pixels
fire
right
left
Score of an
action
0.7
0.2
0.1
Take the action
based on the
probability.
Neural network as Actor
Reward Function →Discriminator
Reinforcement Learning v.s. GAN
Actor
𝑠1
𝑎1
Env
𝑠2
Env
𝑠1
𝑎1
Actor
𝑠2
𝑎2
Env
𝑠3
𝑎2
……
𝑅 𝜏 =
𝑡=1
𝑇
𝑟𝑡
Reward
𝑟1
Reward
𝑟2
“Black box”
You cannot use
backpropagation.
Actor → Generator Fixed
updatedupdated
104.
Inverse Reinforcement Learning
Wehave demonstration of the expert.
Actor
𝑠1
𝑎1
Env
𝑠2
Env
𝑠1
𝑎1
Actor
𝑠2
𝑎2
Env
𝑠3
𝑎2
……
reward function is not available
(in many cases, it is difficult to define reward function)
Ƹ𝜏1, Ƹ𝜏2, ⋯ , Ƹ𝜏 𝑁
Each Ƹ𝜏 is a trajectory
of the expert.
Self driving: record
human drivers
Robot: grab the
arm of robot
Framework of IRL
Expertො𝜋
Actor 𝜋
Obtain
Reward Function R
Ƹ𝜏1, Ƹ𝜏2, ⋯ , Ƹ𝜏 𝑁
𝜏1, 𝜏2, ⋯ , 𝜏 𝑁
Find an actor based
on reward function R
By Reinforcement learning
𝑛=1
𝑁
𝑅 Ƹ𝜏 𝑛 >
𝑛=1
𝑁
𝑅 𝜏
Reward function
→ Discriminator
Actor
→ Generator
Reward
Function R
The expert is always
the best.
107.
𝜏1, 𝜏2, ⋯, 𝜏 𝑁
GAN
IRL
G
D
High score for real,
low score for generated
Find a G whose output
obtains large score from D
Ƹ𝜏1, Ƹ𝜏2, ⋯ , Ƹ𝜏 𝑁
Expert
Actor
Reward
Function
Larger reward for Ƹ𝜏 𝑛,
Lower reward for 𝜏
Find a Actor obtains
large reward
108.
Outline of PartII
Basic Theory of GAN
Helpful Tips
How to evaluate GAN
Relation to Reinforcement Learning
109.
Reference
• Sebastian Nowozin,Botond Cseke, Ryota Tomioka, “f-GAN: Training Generative
Neural Samplers using Variational Divergence Minimization”, NIPS, 2016
• Martin Arjovsky, Soumith Chintala, Léon Bottou, Wasserstein GAN, arXiv, 2017
• Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, Aaron
Courville, Improved Training of Wasserstein GANs, NIPS, 2017
• Junbo Zhao, Michael Mathieu, Yann LeCun, Energy-based Generative Adversarial
Network, arXiv, 2016
• Mario Lucic, Karol Kurach, Marcin Michalski, Sylvain Gelly, Olivier Bousquet, “Are
GANs Created Equal? A Large-Scale Study”, arXiv, 2017
• Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, Xi
Chen Improved Techniques for Training GANs, NIPS, 2016
• Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, Sepp
Hochreiter, GANs Trained by a Two Time-Scale Update Rule Converge to a Local
Nash Equilibrium, NIPS, 2017
110.
Generative Adversarial Network
andits Applications to Signal Processing
and Natural Language Processing
Part III: Speech Signal
Processing
Tsao, Yu Ph.D., Academia Sinica
yu.tsao@citi.sinica.edu.tw
111.
Outline of PartIII
Speech Signal Generation
• Speech enhancement
• Postfilter, speech synthesis, voice conversion
Speech Signal Recognition
• Speech recognition
• Speaker recognition
• Speech emotion recognition
• Lip reading
Conclusion
Our Recent Works
Speech, Speaker, EmotionRecognition and Lip-reading
(Classification Task)
Output
label
Clean data
E
G
𝒚
Emb.
Noisy data
𝒙
𝒛 = 𝑔(𝒙)
𝑔(∙)
ℎ(∙)
𝒙
Accented
speech
෭𝒙
Channel
distortion
ෝ𝒙
Acoustic Mismatch
114.
Outline of PartIII
Speech Signal Generation
• Speech enhancement
• Postfilter, speech synthesis, voice conversion
Speech Signal Recognition
• Speech recognition
• Speaker recognition
• Speech emotion recognition
• Lip reading
Conclusion
Our Recent Works
115.
Speech Enhancement
• Neuralnetwork models for spectral mapping
• Typical objective function
➢ Mean square error (MSE) [Xu et al., TASLP 2015], L1 [Pascual et al., Interspeech
2017], likelihood [Chai et al., MLSP 2017], STOI [Fu et al., TASLP 2018].
Enhancing
➢ GAN is used as a new objective function to estimate the parameters in G.
➢Model structures of G: DNN [Wang et al. NIPS 2012; Xu et al., SPL 2014], DDAE
[Lu et al., Interspeech 2013], RNN (LSTM) [Chen et al., Interspeech 2015;
Weninger et al., LVA/ICA 2015], CNN [Fu et al., Interspeech 2016].
G Output
Objective function
Table 1: Objectiveevaluation results. Table 2: Subjective evaluation results.
Fig. 1: Preference test results.
Speech Enhancement (SEGAN)
SEGAN yields better speech enhancement results than Noisy and Wiener.
• Experimental results
118.
• Pix2Pix [Michelsantiet al., Interpsech 2017]
D Scalar
Clean
Noisy
(Fake/Real)
Output
Noisy
G
Noisy Output Clean
Speech Enhancement
119.
Fig. 2: Spectrogramcomparison of Pix2Pix with baseline methods.
Speech Enhancement (Pix2Pix)
• Spectrogram analysis
Pix2Pix outperforms STAT-MMSE and is competitive to DNN SE.
NG-DNN STAT-MMSE
Noisy Clean NG-Pix2Pix
120.
Table 3: Objectiveevaluation results.
Speech Enhancement (Pix2Pix)
• Objective evaluation and speaker verification test
Table 4: Speaker verification results.
1. From the PESQ and STOI evaluations, Pix2Pix outperforms Noisy
and MMSE and is competitive to DNN SE.
2. From the speaker verification results, Pix2Pix outperforms the
baseline models when the clean training data is used.
121.
• Frequency-domain SEGAN(FSEGAN) [Donahue et al., ICASSP 2018]
D Scalar
Clean
Noisy
(Fake/Real)
Output
Noisy
G
Noisy Output Clean
Speech Enhancement
122.
Fig. 3: Spectrogramcomparison of FSEGAN with L1-trained method.
Speech Enhancement (FSEGAN)
• Spectrogram analysis
FSEGAN reduces both additive noise and reverberant smearing.
123.
Table 5: WER(%) of SEGAN and FSEGAN. Table 6: WER (%) of FSEGAN with retrain.
Speech Enhancement (FSEGAN)
• ASR results
1. From Table 5, (1) FSEGAN improves recognition results for ASR-Clean.
(2) FSEGAN outperforms SEGAN as front-ends.
2. From Table 6, (1) Hybrid Retraining with FSEGAN outperforms Baseline;
(2) FSEGAN retraining slightly underperforms L1–based retraining.
124.
• Speech enhancementthrough a mask function
G
Noisy Output mask Enhanced
Speech Enhancement
Point-wise multiplication
125.
• GAN forspectral magnitude mask estimation (MMS-GAN)
[Ashutosh Pandey and Deliang Wang, ICASSP 2018]
D Scalar
Ref.
mask
Noisy
(Fake/Real)
Output
mask
Noisy
G
Noisy Output mask Ref. mask
Speech Enhancement
We don’t know exactly what D functions.
Our ICML 2019 paper shed some lights on a potential future direction.
126.
𝐺𝑆→𝑇 𝐺 𝑇→𝑆
asclose as possible
𝐷 𝑇
Scalar: belongs to
domain T or not
𝐺 𝑇→𝑆 𝐺𝑆→𝑇
as close as possible
𝐷𝑆
Scalar: belongs to
domain S or not
Speech Enhancement (AFT)
• Cycle-GAN-based acoustic feature transformation (AFT)
[Mimura et al., ASRU 2017]
𝑉𝐹𝑢𝑙𝑙 = 𝑉𝐺𝐴𝑁 𝐺 𝑋→𝑌, 𝐷 𝑌 +𝑉𝐺𝐴𝑁 𝐺 𝑋→𝑌, 𝐷 𝑌
+𝜆 𝑉𝐶𝑦𝑐(𝐺 𝑋→𝑌, 𝐺 𝑌→𝑋)
Noisy Enhanced Noisy
Clean Syn. Noisy Clean
127.
• ASR resultson noise robustness and style adaptation
Table 7: Noise robust ASR. Table 8: Speaker style adaptation.
1. 𝐺 𝑇→𝑆 can transform acoustic features and effectively improve
ASR results for both noisy and accented speech.
2. 𝐺𝑆→𝑇 can be used for model adaptation and effectively improve
ASR results for noisy speech.
S: Clean; 𝑇: Noisy JNAS: Read; CSJ-SPS: Spontaneous (relax);
CSJ-APS: Spontaneous (formal);
Speech Enhancement (AFT)
128.
Outline of PartIII
Speech Signal Generation
• Speech enhancement
• Postfilter, speech synthesis, voice conversion
Speech Signal Recognition
• Speech recognition
• Speaker recognition
• Speech emotion recognition
• Lip reading
Conclusion
Our Recent Works
129.
• Postfilter forsynthesized or transformed speech
➢ Conventional postfilter approaches for G estimation include global variance
(GV) [Toda et al., IEICE 2007], variance scaling (VS) [Sil’en et al., Interpseech
2012], modulation spectrum (MS) [Takamichi et al., ICASSP 2014],DNN with
MSE criterion [Chen et al., Interspeech 2014; Chen et al., TASLP 2015].
➢ GAN is used a new objective function to estimate the parameters in G.
Postfilter
Synthesized
spectral texture
Natural
spectral texture
G Output
Objective function
Speech
synthesizer
Voice
conversion
Speech
enhancement
130.
• GAN postfilter[Kaneko et al., ICASSP 2017]
➢ Traditional MMSE criterion results in statistical averaging.
➢ GAN is used as a new objective function to estimate the parameters in G.
➢ The proposed work intends to further improve the naturalness of
synthesized speech or parameters from a synthesizer.
Postfilter
Synthesized
Mel cepst. coef.
Natural
Mel cepst. coef.
D
Nature
or
Generated
Generated
Mel cepst. coef.
G
131.
Fig. 4: Spectrogramsof: (a) NAT (nature); (b) SYN (synthesized); (c) VS (variance
scaling); (d) MS (modulation spectrum); (e) MSE; (f) GAN postfilters.
Postfilter (GAN-based Postfilter)
• Spectrogram analysis
GAN postfilter reconstructs spectral texture similar to the natural one.
132.
Fig. 5: Mel-cepstraltrajectories (GANv:
GAN was applied in voiced part).
Fig. 6: Averaging difference in
modulation spectrum per Mel-
cepstral coefficient.
Postfilter (GAN-based Postfilter)
• Objective evaluations
GAN postfilter reconstructs spectral texture similar to the natural one.
133.
Table 9: Preferencescore (%). Bold font indicates the numbers over 30%.
Postfilter (GAN-based Postfilter)
• Subjective evaluations
1. GAN postfilter significantly improves the synthesized speech.
2. GAN postfilter is effective particularly in voiced segments.
3. GANv outperforms GAN and is comparable to NAT.
Fig. 7: AveragedGVs of MCCs.
Speech Synthesis (ASV)
• Objective and subjective evaluations
1. The proposed algorithm generates MCCs similar to the natural ones.
Fig. 8: Scores of speech quality.
2. The proposed algorithm outperforms conventional MGE training.
Fig. 10: Scoresof speech quality
(sp and F0).
.
Speech Synthesis (SS-GAN)
• Subjective evaluations
Fig. 9: Scores of speech quality (sp).
The proposed algorithm works for both spectral parameters and F0.
138.
• Convert (transform)speech from source to target
➢ Conventional VC approaches include Gaussian mixture model (GMM) [Toda
et al., TASLP 2007], non-negative matrix factorization (NMF) [Wu et al., TASLP
2014; Fu et al., TBME 2017], locally linear embedding (LLE) [Wu et al.,
Interspeech 2016], variational autoencoder (VAE) [Hsu et al., APSIPA
2016], restricted Boltzmann machine (RBM) [Chen et al., TASLP
2014], feed forward NN [Desai et al., TASLP 2010], recurrent NN (RNN)
[Nakashika et al., Interspeech 2014].
Voice Conversion
G Output
Objective function
Target
speaker
Source
speaker
139.
• VAW-GAN [Hsuet al., Interspeech 2017]
➢Conventional MMSE approaches often encounter the “over-smoothing” issue.
➢ GAN is used a new objective function to estimate G.
➢ The goal is to increase the naturalness, clarity, similarity of converted speech.
Voice Conversion
D
Real
or
Fake
G
Target
speaker
Source
speaker
𝑉 𝐺, 𝐷 = 𝑉𝐺𝐴𝑁 𝐺, 𝐷 + 𝜆 𝑉𝑉𝐴𝐸 𝒙|𝒚
140.
• Objective andsubjective evaluations
Fig. 12: MOS on naturalness.Fig. 11: The spectral envelopes.
Voice Conversion (VAW-GAN)
VAW-GAN outperforms VAE in terms of objective and subjective
evaluations with generating more structured speech.
141.
• CycleGAN-VC [Kanekoet al., Eusipco 2018]
• used a new objective function to estimate G
𝑉𝐹𝑢𝑙𝑙 = 𝑉𝐺𝐴𝑁 𝐺 𝑋→𝑌, 𝐷 𝑌 +𝑉𝐺𝐴𝑁 𝐺 𝑋→𝑌, 𝐷 𝑌
+𝜆 𝑉𝐶𝑦𝑐(𝐺 𝑋→𝑌, 𝐺 𝑌→𝑋)
Voice Conversion
𝑮 𝑺→𝑻 𝐺 𝑇→𝑆
as close as possible
𝑫 𝑻
Scalar: belongs to
domain T or not
Scalar: belongs to
domain S or not
𝐺 𝑇→𝑆 𝑮 𝑺→𝑻
as close as possible
𝑫 𝑺
Target Syn. Source Target
Source Syn. Target Source
142.
• Subjective evaluations
Fig.13: MOS for naturalness.
Fig. 14: Similarity of to source and
to target speakers. S: Source;
T:Target; P: Proposed; B:Baseline
Voice Conversion (CycleGAN-VC)
1. The proposed method uses non-parallel data.
2. For naturalness, the proposed method outperforms baseline.
3. For similarity, the proposed method is comparable to the baseline.
Target
speaker
Source
speaker
143.
Outline of PartIII
Speech Signal Generation
• Speech enhancement
• Postfilter, speech synthesis, voice conversion
Speech Signal Recognition
• Speech recognition
• Speaker recognition
• Speech emotion recognition
• Lip reading
Conclusion
Our Recent Works
144.
Speech, Speaker, EmotionRecognition and Lip-reading
(Classification Task)
Output
label
Clean data
E
G
𝒚
𝒙
Emb.
Noisy data
𝒙
𝒛 = 𝑔(𝒙)
𝑔(∙)
ℎ(∙)
Accented
speech
෭𝒙
Channel
distortion
ෝ𝒙
Acoustic Mismatch
• ASR resultsin known (k) and unknown (unk)
noisy conditions
Speech Recognition (AMT)
Table 10: WER of DNNs with single-task learning (ST) and AMT.
The AMT-DNN outperforms ST-DNN with yielding lower WERs.
147.
Speech Recognition
• Domainadversarial training for accented ASR (DAT)
[Sun et al., ICASSP2018]
Output 2
Domain
Output 1
Senone
Input
Acoustic feature
E
G D
GRL
𝑉𝑧𝑉𝑦
𝒛
𝒚
𝒙
𝑉𝑦=− σ𝑖 log 𝑃(𝑦𝑖|𝑥𝑖; 𝜃 𝐸, 𝜃 𝐺)
𝑉𝑧=− σ𝑖 log 𝑃(𝑧𝑖|𝑥𝑖; 𝜃 𝐸, 𝜃 𝐷)
𝜃 𝐺 ← 𝜃 𝐺 − ϵ
𝜕𝑉𝑦
𝜕𝜃 𝐺
𝜃 𝐸 ← 𝜃 𝐸 − ϵ
𝜕𝑉𝑦
𝜕𝜃 𝐸
𝜃 𝐷 ← 𝜃 𝐷 − ϵ
𝜕𝑉𝑧
𝜕𝜃 𝐷
Model update
Max
classification
accuracy
Max domain
accuracy
Max classification accuracy
Objective function
+𝛼
𝜕𝑉𝑧
𝜕𝜃 𝐸
and Min domain accuracy
148.
• ASR resultson accented speech
Speech Recognition (DAT)
1. With labeled transcriptions, ASR performance notably improves.
Table 11: WER of the baseline and adapted model.
2. DAT is effective in learning features invariant to domain differences
with and without labeled transcriptions.
STD: standard speech
149.
Speech Recognition
• UnsupervisedAdaptation with Domain Separation
Networks (DSN) [Meng et al., ASRU 2017]
R
PEt
R
PEs
𝒙
𝒙
Output 1
Senone
Clean data
E
GL1
𝒚
𝒙
Emb.
Noisy data
E
𝒙
Emb.𝒛 = 𝑔(𝒙) 𝒛 = 𝑔(𝒙)
𝑔(∙)𝑔(∙)
ℎ(∙)D
Output 2
Domain
𝒅
150.
• Results onASR in noise (CHiME3):
Speech Recognition (DSN)
1. DSN outperforms GRL consistently over different noise types.
2. The results confirmed the additional gains provided by private
component extractors.
Table 12: WER (in %) of Robust ASR on the CHiME3 task.
151.
Outline of PartIII
Speech Signal Generation
• Speech enhancement
• Postfilter, speech synthesis, voice conversion
Speech Signal Recognition
• Speech recognition
• Speaker recognition
• Speech emotion recognition
• Lip reading
Conclusion
Our Recent Works
152.
Speaker Recognition
• Domainadversarial neural network (DANN)
[Wang et al., ICASSP 2018]
DANN
DANN
Pre-
processing
Pre-
processing
Scoring
Enroll
i-vector
Test
i-vector
Output 2
Domain
Output 1
Speaker
ID
Input
Acoustic feature
E
G D
GRL
𝑉𝑧𝑉𝑦
𝒛
𝒚
𝒙
153.
• Recognition resultsof domain mismatched conditions
Table 13: Performance of DAT and the state-of-the-art methods.
Speaker Recognition (DANN)
The DAT approach outperforms other methods with
achieving lowest EER and DCF scores.
154.
Outline of PartIII
Speech Signal Generation
• Speech enhancement
• Postfilter, speech synthesis, voice conversion
Speech Signal Recognition
• Speech recognition
• Speaker recognition
• Speech emotion recognition
• Lip reading
Conclusion
Our Recent Works
155.
Emotion Recognition
• AdversarialAE for emotion recognition (AAE-ER)
[Sahu et al., Interspeech 2017]
AE with GAN :
𝐻 ℎ 𝒛 , 𝒙 + λ 𝑉𝐺𝐴𝑁 (𝒒, 𝑔(𝒙))
E
D
𝒙
Emb.
Syn.
𝒛 = 𝑔(𝒙)
𝑔(∙)
ℎ(∙)
𝒒
𝒙
G
The distribution of code vectors
156.
• Recognition resultsof domain mismatched conditions:
Table 15: Classification results on real and synthesized features.
Emotion Recognition (AAE-ER)
Table 14: Classification results on different systems.
1. AAE alone could not yield performance improvements.
2. Using synthetic data from AAE can yield higher UAR.
Original
Training
data
157.
Outline of PartIII
Speech Signal Generation
• Speech enhancement
• Postfilter, speech synthesis, voice conversion
Speech Signal Recognition
• Speech recognition
• Speaker recognition
• Speech emotion recognition
• Lip reading
Conclusion
Our Recent Works
158.
Lip-reading
• Domain adversarialtraining for lip-reading (DAT-LR)
[Wand et al., Interspeech 2017]
Output 1
Words
E
G𝑉𝑦
Output 2
Speaker
D
GRL
𝑉𝑧
𝒛
𝒚
𝒙
𝑉𝑦=− σ𝑖 log 𝑃(𝑦𝑖|𝑥𝑖; 𝜃 𝐸, 𝜃 𝐺)
𝑉𝑧=− σ𝑖 log 𝑃(𝑧𝑖|𝑥𝑖; 𝜃 𝐸, 𝜃 𝐷)
𝜃 𝐺 ← 𝜃 𝐺 − ϵ
𝜕𝑉𝑦
𝜕𝜃 𝐺
𝜃 𝐸 ← 𝜃 𝐸 − ϵ
𝜕𝑉𝑦
𝜕𝜃 𝐸
𝜃 𝐷 ← 𝜃 𝐷 − ϵ
𝜕𝑉𝑧
𝜕𝜃 𝐷
Model update
Max
classification
accuracy
Max domain
accuracy
Max classification accuracy
Objective function
+𝛼
𝜕𝑉𝑧
𝜕𝜃 𝐸
and Min domain accuracy
~80% WAC
159.
• Recognition resultsof speaker mismatched conditions
Lip-reading (DAT-LR)
Table 16: Performance of DAT and the baseline.
The DAT approach notably enhances the recognition
accuracies in different conditions.
160.
Outline of PartIII
Speech Signal Generation
• Speech enhancement
• Postfilter, speech synthesis, voice conversion
Speech Signal Recognition
• Speech recognition
• Speaker recognition
• Speech emotion recognition
• Lip reading
Conclusion
Our Recent Works
Speech, Speaker, EmotionRecognition and Lip-reading
(Classification Task)
Output
label
Clean data
E
G
𝒚
𝒙
Emb.
Noisy data
𝒙
𝒛 = 𝑔(𝒙)
𝑔(∙)
ℎ(∙)
Accented
speech
෭𝒙
Channel
distortion
ෝ𝒙
Acoustic Mismatch
163.
References
Speech enhancement (conventionalmethods)
• Y.-X. Wang and D.-L. Wang, Cocktail party processing via structured prediction, NIPS 2012.
• Y. Xu, J. Du, L.-R. Dai, and C.-H. Lee, An experimental study on speech enhancement based on deep neural
networks, IEEE SPL, 2014.
• Y. Xu, J. Du, L.-R. Dai, and Chin-Hui Lee, A regression approach to speech enhancement based on deep neural
networks, IEEE/ACM TASLP, 2015.
• X. Lu, Y. Tsao, S. Matsuda, H. Chiroi, Speech enhancement based on deep denoising autoencoder, Interspeech
2012.
• Z. Chen, S. Watanabe, H. Erdogan, J. R. Hershey, Integration of speech enhancement and recognition using long-
short term memory recurrent neural network, Interspeech 2015.
• F. Weninger, H. Erdogan, S. Watanabe, E. Vincent, J. L. Roux, J. R. Hershey, and B. Schuller, Speech enhancement
with LSTM recurrent neural networks and Its application to noise-robust ASR, LVA/ICA, 2015.
• S.-W. Fu, Y. Tsao, and X.-G. Lu, SNR-aware convolutional neural network modeling for speech enhancement,
Interspeech, 2016.
• S.-W. Fu, Y. Tsao, X.-G. Lu, and Hisashi Kawai, End-to-end waveform utterance enhancement for direct evaluation
metrics optimization by fully convolutional neural networks, IEEE/ACM TASLP, 2018.
Speech enhancement (GAN-based methods)
• P. Santiago, B. Antonio, and S. Joan, SEGAN: Speech enhancement generative adversarial network, Interspeech,
2017.
• D. Michelsanti, and Z.-H. Tan, Conditional generative adversarial networks for speech enhancement and noise-
robust speaker verification, Interspeech, 2017.
• C. Donahue, B. Li, and P. Rohit, Exploring speech enhancement with generative adversarial networks for robust
speech recognition, ICASSP, 2018.
• T. Higuchi Takuya, K. Kinoshita, D. Marc, and T. Nakatani. Adversarial training for data-driven speech
enhancement without parallel Corpus, ASRU, 2017.
• S. Pascual, M. Park, J. Serrà, A. Bonafonte, K.-H. Ahn, Language and noise transfer in speech enhancement
generative adversarial network, ICASSP 2018.
164.
References
Speech enhancement (GAN-basedmethods)
• A. Pandey and D. Wang, On adversarial training and loss functions for speech enhancement, ICASSP 2018.
• M. H. Soni, Neil Shah, and H. A. Patil, Time-frequency masking-based speech enhancement using generative
adversarial network, ICASSP 2018.
• Z. Meng, J.-Y. Li, Y.-G. Gong, B.-H. Juang, Adversarial feature-mapping for speech enhancemen, Interspeech, 2018.
• L.-W. Chen, M.Yu, Y.-M. Qian, D. Su, D. Yu, Permutation invariant training of generative adversarial network for
monaural speech separation, Interspeech 2018.
• D. Baby and S. Verhulst, Sergan: Speech enhancement using relativistic generative adversarial networks with
gradient penalty, ICASSP 2019.
165.
Postfilter (conventional methods)
•T. Tod, and K. Tokuda, A speech parameter generation algorithm considering global variance for HMM-based
speech synthesis, IEICE Trans. Inf. Syst., 2007.
• H. Sil’en, E. Helander, J. Nurminen, and M. Gabbouj, Ways to implement global variance in statistical speech
synthesis, Interspeech, 2012.
• S. Takamichi, T. Toda, N. Graham, S. Sakriani, and S. Nakamura, A postfilter to modify the modulation spectrum
in HMM-based speech synthesis, ICASSP, 2014.
• L.-H. Chen, T. Raitio, C. V. Botinhao, J. Yamagishi, and Z.-H. Ling, DNN-based stochastic postfilter for HMM-
based speech synthesis, Interspeech, 2014.
• L.-H. Chen, T. Raitio, C. V. Botinhao, Z.-H. Ling, and J. Yamagishi, A deep generative architecture for postfiltering
in statistical parametric speech synthesis, IEEE/ACM TASLP, 2015.
Postfilter (GAN-based methods)
• K. Takuhiro, K. Hirokazu, H. Nobukatsu, Y. Ijima, K. Hiramatsu, and K. Kashino, Generative adversarial network-
based postfilter for statistical parametric speech synthesis, ICASSP, 2017.
• K. Takuhiro, T. Shinji, K. Hirokazu, and J. Yamagishi, Generative adversarial network-based postfilter for STFT
spectrograms, Interspeech, 2017.
• Y. Saito, S. Takamichi, and H. Saruwatari, Training algorithm to deceive anti-spoofing verification for DNN-based
speech synthesis, ICASSP, 2017.
• Y. Saito, S. Takamichi, H. Saruwatari, Statistical parametric speech synthesis incorporating generative
adversarial networks, IEEE/ACM TASLP, 2018.
• B. Bollepalli, L. Juvela, and A. Paavo, Generative adversarial network-based glottal waveform model for
statistical parametric speech synthesis, Interspeech, 2017.
• S. Yang, L. Xie, X. Chen, X.-Y. Lou, X. Zhu, D.-Y. Huang, and H.-Z. Li, Statistical parametric speech synthesis using
generative adversarial networks under a multi-task learning framework, ASRU, 2017.
References
166.
VC (conventional methods)
•T. Toda, A. W. Black, and K. Tokuda, Voice conversion based on maximum likelihood estimation of spectral
parameter trajectory, IEEE/ACM TASLP, 2007.
• L.-H. Chen, Z.-H. Ling, L.-J. Liu, and L.-R. Dai, Voice conversion using deep neural networks with layer-wise
generative training, IEEE/ACM TASLP, 2014.
• S. Desai, A. W. Black, B. Yegnanarayana, and K. Prahallad, Spectral mapping using artificial neural networks for
voice conversion, IEEE/ACM TASLP, 2010.
• T. Nakashika, T. Takiguchi, Y. Ariki, High-order sequence modeling using speaker-dependent recurrent temporal
restricted boltzmann machines for voice conversion, Interspeech, 2014.
• K. Takuhiro, K. Hirokazu, H. Kaoru, and K. Kunio, Sequence-to-sequence voice conversion with similarity metric
learned using generative adversarial networks, Interspeech, 2017.
• Z.-Z. Wu, T. Virtanen, E.-S. Chng, and H.-Z. Li, Exemplar-based sparse representation with residual compensation
for voice conversion, IEEE/ACM TASLP, 2014.
• S.-. Fu, P.-C. Li, Y.-H. Lai, C.-C. Yang, L.-C. Hsieh, and Y. Tsao, Joint dictionary learning-based non-negative matrix
factorization for voice conversion to improve speech intelligibility after oral surgery, IEEE TBME, 2017.
• Y.-C. Wu, H.-T. Hwang, C.-C. Hsu, Y. Tsao, and H.-M. Wang, Locally linear embedding for exemplar-based spectral
conversion, Interspeech, 2016.
• C.-C. Hsu, H.-T. Hwang, Y.-C. Wu, Y. Tsao, Y., and H.-M. Wang, Voice conversion from non-parallel corpora using
variational auto-encoder. APSIPA 2016.
VC (GAN-based methods)
• C.-C. Hsu, H.-T. Hwang, Y.-C. Wu, Y. Tsao, and H.-M. Wang Voice conversion from unaligned corpora using
variational autoencoding wasserstein generative adversarial networks, Interspeech 2017.
• K. Takuhiro, K. Hirokazu, H. Kaoru, and K. Kunio, Sequence-to-sequence voice conversion with similarity metric
learned using generative adversarial networks, Interspeech, 2017.
References
167.
VC (GAN-based methods)
•K. Takuhiro, and K. Hirokazu. Parallel-data-free voice conversion using cycle-consistent adversarial networks,
arXiv, 2017.
• N. Shah, N. J. Shah, and H. A. Patil, Effectiveness of generative adversarial network for non-audible murmur-to-
whisper speech conversion, Interspeech, 2018.
• J.-C. Chou, C.-C. Yeh, H.-Y. Lee, and L.-S. Lee, Multi-target voice conversion without parallel data by adversarially
learning disentangled audio representations, Interspeech, 2018.
• G. Degottex, and M. Gales, A spectrally weighted mixture of least square error and wasserstein discriminator
loss for generative SPSS, SLT, 2018.
• B. Sisman, M. Zhang, S. Sakti, H. Li, and S. Nakamura, Adaptive wavenet vocoder for residual compensation in
GAN-based voice conversion, SLT, 2018.
• C.-C. Yeh, P.-C. Hsu, J.-C. Chou, H.-Y. Lee, and L.-S. Lee, Rhythm-flexible voice conversion without parallel data
using cycle-GAN over phoneme posteriorgram sequences, SLT, 2018.
• H. Kameoka, T. Kaneko, K. Tanaka, and N. Hojo, STARGAN-VC: Non-parallel many-to-many voice conversion with
star generative adversarial networks, SLT, 2018.
• K. Tanaka, T. Kaneko, N. Hojo, and H. Kameoka, Synthetic-to-natural speech waveform conversion using cycle-
consistent adversarial networks, SLT, 2018.
• O. Ocal, O. H. Elibol, G. Keskin, C. Stephenson, A. Thomas, and K. Ramchandran, Adversarially trained
autoencoders for parallel-data-free voice conversion, ICASSP, 2019.
• F. Fang, X. Wang, J. Yamagishi, and I. Echizen, Audiovisual speaker conversion: Jointly and simultaneously
transforming facial expression and acoustic characteristics, ICASSP, 2019.
• S. Seshadri, L. Juvela, J. Yamagishi, Okko Räsänen, and P. Alku, Cycle-consistent adversarial networks for non-
parallel vocal effort based speaking style conversion, ICASSP, 2019.
• T. Kaneko, H. Kameoka, K. Tanaka, and N. Hojo, CYCLEGAN-VC2: Improved cyclegan-based non-parallel voice
conversion, ICASSP, 2019.
• L. Juvela, B. Bollepalli, J. Yamagishi, and P. Alku, Waveform generation for text-to-speech synthesis using pitch-
synchronous multi-scale generative adversarial networks, ICASSP, 2019.
References
168.
Speaker recognition
• Q.Wang, W. Rao, S.-I. Sun, L. Xie, E.-S. Chng, and H.-Z. Li, Unsupervised domain adaptation via domain
adversarial training for speaker recognition, ICASSP, 2018.
• H. Yu, Z.-H. Tan, Z.-Y. Ma, and J. Guo, Adversarial network bottleneck features for noise robust speaker
verification, arXiv, 2017.
• G. Bhattacharya, J. Alam, & P. Kenny, Adapting end-to-end neural speaker verification to new languages and
recording conditions with adversarial training, ICASSP, 2019.
• Z. Peng, S. Feng, & T. Lee, Adversarial multi-task deep features and unsupervised back-end adaptation for
language recognition, ICASSP, 2019.
• Z. Meng, Y. Zhao, J. Li, & Y. Gong, Adversarial speaker verification, ICASSP, 2019.
• X. Fang, L. Zou, J. Li, L. Sun, & Z.-H. Ling, Channel adversarial training for cross-channel text-independent
speaker recognition, ICASSP, 2019.
• W. Xia, J. Huang, & J. H. Hansen, Cross-lingual text-independent speaker verification using unsupervised
adversarial discriminative domain adaptation, ICASSP, 2019.
• P. S. Nidadavolu, J. Villalba, & N. Dehak, Cycle-GANs for domain adaptation of acoustic features for speaker
recognition, ICASSP, 2019.
• G. Bhattacharya, J. Monteiro, J. Alam, & P. Kenny, Generative adversarial speaker embedding networks for
domain robust end-to-end speaker verification, ICASSP, 2019.
• J. Rohdin, T. Stafylakis, A. Silnova, H. Zeinali, L. Burget, & O. Plchot, Speaker verification using end-to-end
adversarial language adaptation, ICASSP, 2019.
• Zhou, J., Jiang, T., Li, L., Hong, Q., Wang, Z., & Xia, B., Training multi-task adversarial network for extracting
noise-robust speaker embedding, ICASSP, 2019.
• J. Zhang, N. Inoue, & K. Shinoda, I-vector transformation using conditional generative adversarial networks for
short utterance speaker verification, arXiv, 2018.
• W. Ding, & L. He, Mtgan: Speaker verification through multitasking triplet generative adversarial networks, arXiv,
2018.
• X. Miao, I. McLoughlin, S. Yao, & Y. Yan, Improved conditional generative adversarial net classification for
spoken language recognition, SLT, 2018.
References
169.
Automatic Speech Recognition
•Yusuke Shinohara, Adversarial multi-task learning of deep neural networks for robust speech recognition,
Interspeech, 2016.
• D. Serdyuk, K. Audhkhasi, P. Brakel, B. Ramabhadran, S. Thomas, and Y. Bengio, Invariant Representations for
Noisy Speech Recognition, arXiv, 2016.
• Masato Mimura, Shinsuke Sakai, and Tatsuya Kawahara, Cross-domain speech recognition using nonparallel
corpora with cycle-consistent adversarial networks, ASRU, 2017.
• A. Sriram, H.-W Jun, Y. Gaur, and S. Satheesh, Robust speech recognition using generative adversarial networks,
arXiv, 2017.
• Z. Meng, Z. Chen, V. Mazalov, J. Li, J., and Y. Gong, Unsupervised adaptation with domain separation networks
for robust speech recognition, ASRU, 2017.
• Z. Meng, J. Li, Z. Chen, Y. Zhao, V. Mazalov, Y. Gong, and B.-H. Juang, Speaker-invariant training via adversarial
learning, ICASSP, 2018.
• Z. Meng, J. Li, Y. Gong, and B.-H. Juang, Adversarial teacher-student learning for unsupervised domain
adaptation, ICASSP, 2018.
• Y. Zhang, P. Zhang, and Y. Yan, Improving language modeling with an adversarial critic for automatic speech
recognition, Interspeech, 2018.
• S. Sun, C. Yeh, M. Ostendorf, M. Hwang, and L. Xie, Training augmentation with adversarial examples for robust
speech recognition, Interspeech, 2018.
• Z. Meng, J. Li, Y. Gong, and B.-H. Juang, Adversarial feature-mapping for speech enhancement, Interspeech
2018.
• K. Wang, J. Zhang, S. Sun, Y. Wang, F. Xiang, and L. Xie, Investigating generative adversarial networks based
speech dereverberation for robust speech recognition, Interspeech 2018.
• Z. Meng, J. Li, Y. Gong, B.-H. Juang, Cycle-consistent speech enhancement, Interspeech 2018.
• J. Drexler and J. Glass, Combining end-to-end and adversarial training for low-resource speech recognition, SLT,
2018.
• A. H. Liu, H. Lee and L. Lee, Adversarial training of end-to-end speech recognition using a criticizing language
model, ICASSP, 2019.
References
170.
Automatic Speech Recognition
•J. Yi, J. Tao and Y. Bai, Language-invariant bottleneck features from adversarial end-to-end acoustic models for
• low resource speech recognition, ICASSP, 2019.
• D. Haws and X. Cui, Cyclegan bandwidth extension acoustic modeling for automatic speech recognition, ICASSP,
2019.
• Z. Meng, J. Li, J. and Y. Gong, Attentive adversarial learning for domain-Invariant training, ICASSP, 2019.
• Z. Meng, Y. Zhao, J. Li, and Y. Gong, Adversarial speaker verification, ICASSP, 2019.
• Z. Meng, Y. Zhao, J. Li, and Y. Gong., Adversarial speaker adaptation, ICASSP, 2019.
Emotion recognition
• J. Chang, and S. Scherer, Learning representations of emotional speech with deep convolutional generative
adversarial networks, ICASSP, 2017.
• S. Sahu, R. Gupta, G. Sivaraman, W. AbdAlmageed, and C. Espy-Wilson, Adversarial auto-encoders for speech
based emotion recognition. Interspeech, 2017.
• S. Sahu, R. Gupta, and C. E.-Wilson, On enhancing speech emotion recognition using generative adversarial
networks, Interspeech 2018.
• C.-M. Chang, and C.-C. Lee, Adversarially-enriched acoustic code vector learned from out-of-context affective
corpus for robust emotion recognition, ICASSP 2019.
• J. Liang, S. Chen, J. Zhao, Q. Jin, H. Liu, and L. Lu, Cross-culture multimodal emotion recognition with adversarial
learning, ICASSP 2019.
Lipreading
• M. Wand, and J. Schmidhuber, Improving speaker-independent lipreading with domain-adversarial training,
arXiv, 2017.
References
Speech Enhancement
N5 N5N4
N7N9N10N12
Unseen
N11
𝒛
E G
𝑉𝑦
𝜃 𝐺 ← 𝜃 𝐺 − ϵ
𝜕𝑉𝑦
𝜕𝜃 𝐺 𝜃 𝐸 ← 𝜃 𝐸 − ϵ
𝜕𝑉𝑦
𝜕𝜃 𝐸
Min reconstruction
error
Min reconstruction error
• Noise Adaptive Speech Enhancement (NA-SE)
[Liao et al., Interspeech 2019] [Wed-P-6-E]
173.
Speech Enhancement (NA-SE)
N5N5N4
N7 N9N10N12
Unseen
N11
Noise
Type
𝒛
E G
𝑉𝑦
D
𝑉𝑧
𝜃 𝐺 ← 𝜃 𝐺 − ϵ
𝜕𝑉𝑦
𝜕𝜃 𝐺 𝜃 𝐸 ← 𝜃 𝐸 − ϵ
𝜕𝑉𝑦
𝜕𝜃 𝐸
𝜃 𝐷 ← 𝜃 𝐷 − ϵ
𝜕𝑉𝑧
𝜕𝜃 𝐷
Min reconstruction
error
Max domain
accuracy
Min reconstruction error
+𝛼
𝜕𝑉𝑧
𝜕𝜃 𝐸
and Min domain accuracy
• Domain adversarial training for NA-SE
GRL
174.
Speech Enhancement (NA-SE)
•Objective evaluations
The DAT-based unsupervised adaptation can notably overcome
the mismatch issue of training and testing noise types.
Fig. 15: PESQ at different SNR levels.
175.
• GAN forspectral magnitude mask estimation (MMS-GAN)
[Pandey et al., ICASSP 2018]
D Scalar
Ref.
mask
Noisy
(Fake/Real)
Output
mask
Noisy
G
Noisy Output mask Ref. mask
Speech Enhancement
176.
• MetricGAN forSpeech Enhancement [Fu et al., ICML 2019]
D Metric
Score
(0~1)
G
Noisy Spect. Output mask
Speech Enhancement
1.00.4
Clean
Spect.
Enhanced
Spect.
Enhanced Spect.
Point-wise multiplication
• Multi-target VC[Chou et al., Interspeech 2018]
𝑒𝑛𝑐(𝒙)
𝒙
Voice Conversion
C
𝑬nc Dec
𝒚
𝒚 𝒚′····
𝑒𝑛𝑐(𝒙)
𝑬nc Dec
𝒚"
𝑮
𝒚"
D+C
Real
data
𝒙 𝑑𝑒𝑐(𝑒𝑛𝑐 𝒙 , 𝒚) 𝑑𝑒𝑐(𝑒𝑛𝑐 𝒙 , 𝒚′)
➢ Stage-1
➢ Stage-2
F/R
ID
···
179.
• Subjective evaluations
VoiceConversion (Multi-target VC)
Fig. 16: Preference test results
1. The proposed method uses non-parallel data.
2. The multi-target VC approach outperforms one-stage only.
3. The multi-target VC approach is comparable to Cycle-GAN-VC in
terms of the naturalness and the similarity.
180.
• Controller-generator-discriminator VCon Impaired
Speech [Chen et al., Interspeech 2019]
Voice Conversion
Previous applications: hearing aids; murmur to normal speech; bone-
conductive microphone to air-conductive microphone.
Before
Proposed: improving the speech intelligibility of surgical patients.
Target: oral cancer (top five cancer for male in Taiwan).
After Before After
[Mon-P-2-A]
Voice Conversion (CGDVC)
• Spectrogram analysis
Fig. 17: Spectrogram comparison of CGD with CycleGAN.
183.
• Subjective evaluations
VoiceConversion (CGD VC)
The proposed method outperforms conditional GAN and CycleGAN
in terms of content similarity, speaker similarity, and articulation.
Fig. 18: MOS for content similarity, speaker similarity, and articulation.
184.
Pathological Voice Detection
•Detection of Pathological Voice Using Cepstrum Vectors:
A Deep Learning Approach [Fang et al., Journal of Voice 2018]
GMM SVM DNN
MEEI 98.28 98.26 99.14
FEMH (M) 90.24 93.04 94.26
FEMH (F) 90.20 87.40 90.52
Table 17: Detection performance based on voice.
185.
Pathological Voice Detection
•Robustness Against Channel [Hsu et al., NeurIPS Workshop 2018]
𝒛
E G
𝑉𝑦
D
𝑉𝑧
𝒚
DNN (S) DNN (T) DNN (FT) Unsup. DAT Sup. DAT
PR-AUC 0.8848 0.8509 0.9021 0.9455 0.9522
The unsupervised DAT notably increased the performance
robustness against channel effects and generated comparable
results as compared to supervised DAT.
𝒙
Table 18: Detection results of sup. and unsup. DAT under channel mismatches.
186.
• C.-F. Liao,Y. Tsao, H.-Y. Lee and H.-M. Wang, Noise adaptive speech enhancement using domain adversarial
training, Interspeech 2019.
• J.-C. Chou, C.-C. Yeh, H.-Y. Lee, and L.-S. Lee. "Multi-target voice conversion without parallel data by
adversarially learning disentangled audio representations. Interspeech 2018.
• L.-W. Chen, H.-Y. Lee, and Y. Tsao, Generative adversarial networks for unpaired voice transformation on
impaired speech, Interspeech 2019.
• S.-W. Fu, C.-F. Liao, Y. Tsao, S.-D. Lin, MetricGAN: Generative adversarial networks based black-box metric
scores optimization for speech enhancement, ICML, 2019.
• C.-T. Wang, F.-C. Lin, J.-Y. Chen, M.-J. Hsiao, S.-H. Fang, Y.-H. Lai, Y. Tsao, Detection of pathological voice using
cepstrum vectors: a deep learning approach, Journal of Voice, 2018.
• S.-Y. Tsui, Y. Tsao, C.-W. Lin, S.-H. Fang, and C.-T. Wang, Demographic and symptomatic features of voice
disorders and their potential application in classification using machine learning algorithms, Folia Phoniatrica et
Logopaedica, 2018.
• S.-H. Fang, C.-T. Wang, J.-Y. Chen, Y. Tsao and F.-C. Lin, Combining acoustic signals and medical records to
improve pathological voice classification, APSIPA, 2019.
• Y.-T. Hsu, Z. Zhu, C.-T. Wang, S.-H. Fang, F. Rudzicz, and Y. Tsao, Robustness against the channel effect in
pathological voice detection, NeurIPS 2018 Machine Learning for Health (ML4H) Workshop, 2018.
References
187.
Thank You VeryMuch
Tsao, Yu Ph.D., Academia Sinica
yu.tsao@citi.sinica.edu.tw
Generative Adversarial Network
and its Applications to Signal Processing
and Natural Language Processing
Part III: Speech Signal Processing
NLP tasks usuallyinvolve Sequence Generation
How to use GAN to improve sequence generation?
190.
Outline of PartIV
Sequence Generation by GAN
Unsupervised Conditional Sequence Generation
• Text Style Transfer
• Unsupervised Abstractive Summarization
• Unsupervised Translation
• Unsupervised Speech Recognition
191.
Why we needGAN?
• Chat-bot as example
Encoder Decoder
Input sentence c
output
sentence x
Training
data:
A: How are you ?
B: I’m good.
…………
How are you ?
I’m good.
Seq2seq
Output: Not bad I’m John.
Maximize
likelihood
Training Criterion
Human better
better
192.
Reinforcement Learning
Human
Input sentencec response sentence x
Chatbot
En De
response sentence x
Input sentence c
[Li, et al., EMNLP, 2016]
reward
𝑅 𝑐, 𝑥
Learn to maximize expected reward
E.g. Policy Gradient
human
“How are you?” “Not bad” “I’m John”
-1+1
Conditional GAN
Discriminator
Input sentencec response sentence x
Chatbot
En De
response sentence x
Input sentence c
reward
𝑅 𝑐, 𝑥
I am busy.
Replace human evaluation with
machine evaluation [Li, et al., EMNLP, 2017]
However, there is an issue when you train your generator.
A A
A
B
A
B
A
A
B
B
B
<BOS>
Can weuse
gradient ascent?
Discriminator
scalarNO!
Update Parameters
Having non-
differentiable
part
: obtained
by attention
198.
Three Categories ofSolutions
Gumbel-softmax
• [Matt J. Kusner, et al., arXiv, 2016][Weili Nie, et al. ICLR, 2019]
Continuous Input for Discriminator
• [Sai Rajeswar, et al., arXiv, 2017][Ofir Press, et al., ICML workshop, 2017][Zhen
Xu, et al., EMNLP, 2017][Alex Lamb, et al., NIPS, 2016][Yizhe Zhang, et al., ICML,
2017]
Reinforcement Learning
• [Yu, et al., AAAI, 2017][Li, et al., EMNLP, 2017][Tong Che, et al, arXiv,
2017][Jiaxian Guo, et al., AAAI, 2018][Kevin Lin, et al, NIPS, 2017][William
Fedus, et al., ICLR, 2018]
Three Categories ofSolutions
Gumbel-softmax
• [Matt J. Kusner, et al., arXiv, 2016][Weili Nie, et al. ICLR, 2019]
Continuous Input for Discriminator
• [Sai Rajeswar, et al., arXiv, 2017][Ofir Press, et al., ICML workshop, 2017][Zhen
Xu, et al., EMNLP, 2017][Alex Lamb, et al., NIPS, 2016][Yizhe Zhang, et al., ICML,
2017]
Reinforcement Learning
• [Yu, et al., AAAI, 2017][Li, et al., EMNLP, 2017][Tong Che, et al, arXiv,
2017][Jiaxian Guo, et al., AAAI, 2018][Kevin Lin, et al, NIPS, 2017][William
Fedus, et al., ICLR, 2018]
201.
A A
A
B
A
B
A
A
B
B
B
<BOS>
Use thedistribution
as the input of
discriminator
Avoid the sampling
process
Discriminator
scalar
Update Parameters
We can do
backpropagation
now.
202.
What is theproblem?
• Real sentence
• Generated
1
0
0
0
0
0
1
0
0
0
0
0
1
0
0
0
0
0
1
0
0
0
0
0
1
0.9
0.1
0
0
0
0.1
0.9
0
0
0
0.1
0.1
0.7
0.1
0
0
0
0.1
0.8
0.1
0
0
0
0.1
0.9
Can never
be 1-hot
Discriminator can
immediately find
the difference.
Discriminator with constraint
(e.g. WGAN) can be helpful.
203.
Three Categories ofSolutions
Gumbel-softmax
• [Matt J. Kusner, et al., arXiv, 2016][Weili Nie, et al. ICLR, 2019]
Continuous Input for Discriminator
• [Sai Rajeswar, et al., arXiv, 2017][Ofir Press, et al., ICML workshop, 2017][Zhen
Xu, et al., EMNLP, 2017][Alex Lamb, et al., NIPS, 2016][Yizhe Zhang, et al., ICML,
2017]
Reinforcement Learning
• [Yu, et al., AAAI, 2017][Li, et al., EMNLP, 2017][Tong Che, et al, arXiv,
2017][Jiaxian Guo, et al., AAAI, 2018][Kevin Lin, et al, NIPS, 2017][William
Fedus, et al., ICLR, 2018]
Tips for SequenceGeneration
GAN
.
RL is difficult to train GAN is difficult to train
Sequence Generation GAN (RL+GAN)
206.
Tips for SequenceGeneration
GAN
• Usually the generator are fine-tuned from a model learned
by maximum-likelihood.
• However, with enough hyperparameter-tuning and tips,
ScarchGAN can train from scratch.
[Cyprien de Masson
d'Autume, et al.,
arXiv 2019]
207.
Tips for SequenceGeneration
GAN
• Typical
• Reward for Every Generation Step
Discrimi
natorChatbot
En De
You is good
Discrimi
natorChatbot
En De 0.9
0.1
0.1
0.1
You
You is
You is good
I don’t know which
part is wrong …
208.
Tips for SequenceGeneration
GAN
• Reward for Every Generation Step
Discrimi
natorChatbot
En De 0.9
0.1
0.1
You
You is
You is good
Method 2. Discriminator For Partially Decoded Sequences
Method 1. Monte Carlo (MC) Search [Yu, et al., AAAI, 2017]
[Li, et al., EMNLP, 2017]
Method 3. Step-wise evaluation[Tual, Lee, TASLP, 2019][Xu, et al., EMNLP,
2018][William Fedus, et al., ICLR, 2018]
209.
Empirical Performance
• MLEfrequently generates “I’m sorry”, “I don’t
know”, etc. (corresponding to fuzzy images?)
• GAN generates longer and more complex responses.
• Find more comparison in the survey papers.
• [Lu, et al., arXiv, 2018][Zhu, et al., arXiv, 2018]
• However, no strong evidence shows that GANs are
better than MLE.
• [Stanislau Semeniuta, et al., arXiv, 2018] [Guy Tevet, et al., arXiv, 2018]
[Massimo Caccia, et al., arXiv, 2018]
210.
More Applications
• Supervisedmachine translation [Wu, et al., arXiv
2017][Yang, et al., arXiv 2017]
• Supervised abstractive summarization [Liu, et al., AAAI
2018]
• Image/video caption generation [Rakshith Shetty, et al., ICCV
2017][Liang, et al., arXiv 2017]
• Data augmentation for code-switching ASR [Mon-P-
1-D] [Chang, et al., INTERSPEECH 2019]
If you are trying to generate some sequences,
you can consider GAN.
211.
Outline of PartIV
Sequence Generation by GAN
Unsupervised Conditional Sequence Generation
• Text Style Transfer
• Unsupervised Abstractive Summarization
• Unsupervised Translation
• Unsupervised Speech Recognition
Cycle-GAN
𝐺 𝑋→𝑌 𝐺Y→X
asclose as possible
𝐺Y→X 𝐺 𝑋→𝑌
as close as possible
𝐷 𝑌𝐷 𝑋
scalar: belongs to
domain Y or not
scalar: belongs to
domain X or not
214.
Cycle-GAN
𝐺 𝑋→𝑌 𝐺Y→X
asclose as possible
𝐺Y→X 𝐺 𝑋→𝑌
as close as possible
𝐷 𝑌𝐷 𝑋
negative sentence? positive sentence?
It is bad. It is good. It is bad.
I love you. I hate you. I love you.
positive
positive
positivenegative
negative negative
Non-differentiable Issue?
You already know how to deal with it.
215.
✘ Negative sentenceto positive sentence:
it's a crappy day -> it's a great day
i wish you could be here -> you could be here
it's not a good idea -> it's good idea
i miss you -> i love you
i don't love you -> i love you
i can't do that -> i can do that
i feel so sad -> i happy
it's a bad day -> it's a good day
it's a dummy day -> it's a great day
sorry for doing such a horrible thing -> thanks for doing a
great thing
my doggy is sick -> my doggy is my doggy
my little doggy is sick -> my little doggy is my little doggy
Cycle GAN
感謝 王耀賢 同學提供實驗結果
[Lee, et al.,
ICASSP, 2018]
216.
𝐸𝑁 𝑋
𝐸𝑁𝑌 𝐷𝐸𝑌
𝐷𝐸 𝑋 𝐷 𝑋
𝐷 𝑌
Discriminator
of X domain
Discriminator
of Y domain
Shared Latent Space
Positive
Sentence
Positive
Sentence
Negative
Sentence
Negative
Sentence
Decoder hidden layer as discriminator input
[Shen, et al., NIPS, 2017]
From 𝐸𝑁𝑋 or 𝐸𝑁𝑌
Domain
Discriminator
𝐸𝑁𝑋 and 𝐸𝑁𝑌 fool the
domain discriminator
[Zhao, et al., arXiv, 2017]
[Fu, et al., AAAI, 2018]
Abstractive Summarization
• Nowmachine can do abstractive summary by
seq2seq (write summaries in its own words)
summary 1
summary 2
summary 3
Training Data
summary
seq2seq
(in its own words)
Supervised: We need lots of
labelled training data.
219.
Unsupervised Abstractive
Summarization
• Nowmachine can do abstractive summary by
seq2seq (write summaries in its own words)
summary 1
summary 2
summary 3
seq2seq document
Domain Y Domain X[Wang, et al., EMNLP, 2018]
Unsupervised Abstractive
Summarization
G R
Summary?
Seq2seqSeq2seq
document document
word
sequence
Only need a lot
of documents to
train the model
This is a seq2seq2seq auto-encoder.
Using a sequence of words as latent representation.
not readable …
223.
Unsupervised Abstractive
Summarization
G R
Seq2seqSeq2seq
word
sequence
D
Human written summaries Real or not
Discriminator
Let Discriminator considers
my output as real
document document
Summary?
Readable
224.
Experimental results
ROUGE-1 ROUGE-2ROUGE-L
Supervised 33.2 14.2 30.5
Trivial 21.9 7.7 20.5
Unsupervised
(matched data)
28.1 10.0 25.4
Unsupervised
(no matched data)
27.2 9.1 24.1
English Gigaword (Document title as summary)
• Matched data: using the title of English Gigaword to train
Discriminator
• No matched data: using the title of CNN/Diary Mail to
train Discriminator
[Wang, Lee, EMNLP 2018]
225.
Semi-supervised Learning
25
26
27
28
29
30
31
32
33
34
0 10k500k
ROUGE-1
Number of document-summary pairs used
WGAN Reinforce Supervised
3.8M pairs are used.Approaches to deal with the discrete issue.
unsupervised
semi-supervised
[Wang, Lee,
EMNLP 2018]
226.
More Unsupervised
Summarization
• Unsupervisedsummarization with language prior
• Unsupervised multi-document summarization
[Eric Chu, Peter Liu,
ICML 2019]
[Christos Baziotis, etc al.,
NAACL 2019]
227.
G
Input
Sentence
D
Said by
Trump?
Discriminator
R
Dialogue ResponseGeneration
minimize the reconstruction error
Make the US great again
I would build a great wall
you are fired
What Trump has said
Chat
Bot
Generated
Response
Input
Sentence
(Reconstruct)
[Su, et al., INTERSPEECH, 2019]
(Thu-P-9-C)
General
Dialogues
Towards Unsupervised ASR
-Cycle GAN
G
ASR
Text
R
TTS
D
Real Text?
Discriminator
minimize the reconstruction error (speech chain)
how are you
good morning
i am fine
Real
Text
[Andros Tjandra, et al., ASRU 2017]
[Liu, et al., INTERSPEECH 2018]
[Yeh, et al., ICLR 2019]
[Chen, et al., INTERSPEECH 2019]
232.
Towards Unsupervised ASR
-Cycle GAN
• Unsupervised setting on TIMIT (text and audio are
unpair, text is not the transcription of audio)
• 63.6% PER (oracle boundaries)
• 41.6% PER (automatic segmentation)
• 33.1% PER (automatic segmentation)
• Semi-supervised setting on Librispeech
[Liu, et al., INTERSPEECH 2018]
[Yeh, et al., ICLR 2019]
(Tue-P-4-B)[Chen, et al., INTERSPEECH 2019]
[Liu, et al., ICASSP 2019]
[Tomoki Hayashi, et al., SLT 2018]
[Takaaki Hori, et al., ICASSP 2019]
[Murali Karthick Baskar, et al., INTERSPEECH 2019]
233.
Towards Unsupervised ASR
-Shared Latent Space
Text
Encoder
Audio
Encoder
Audio
Decoder
Text
Decoder
this is text this is text
Unsupervised setting on Librispeech: 76.3% WER
WSJ with 2.5 hours paired data: 64.6% WER
LJ speech with 20 mins paired data: 11.7% PER
[Chen, et al., SLT 2018]
Unsupervised speech translation is also possible!
[Chung, et al., NIPS 2018]
[Jennifer Drexler, et al., SLT 2018]
[Ren, et al., ICML 2019]
[Chung, et al., ICASSP 2019]
234.
Outline of PartIV
Sequence Generation by GAN
Unsupervised Conditional Sequence Generation
• Text Style Transfer
• Unsupervised Abstractive Summarization
• Unsupervised Translation
• Unsupervised Speech Recognition
235.
To Learn More…
https://www.youtube.com/playlist?list=PLJV_el3uVTsMd2G9ZjcpJn1YfnM9wVOBf
You can learn more from the YouTube Channel
(in Mandarin)
236.
Reference
• Sequence Generation
•Jiwei Li, Will Monroe, Alan Ritter, Michel Galley, Jianfeng Gao, Dan
Jurafsky, Deep Reinforcement Learning for Dialogue Generation, EMNLP,
2016
• Jiwei Li, Will Monroe, Tianlin Shi, Sébastien Jean, Alan Ritter, Dan Jurafsky,
Adversarial Learning for Neural Dialogue Generation, EMNLP, 2017
• Matt J. Kusner, José Miguel Hernández-Lobato, GANS for Sequences of
Discrete Elements with the Gumbel-softmax Distribution, arXiv 2016
• Tong Che, Yanran Li, Ruixiang Zhang, R Devon Hjelm, Wenjie Li, Yangqiu
Song, Yoshua Bengio, Maximum-Likelihood Augmented Discrete Generative
Adversarial Networks, arXiv 2017
• Lantao Yu, Weinan Zhang, Jun Wang, Yong Yu, SeqGAN: Sequence
Generative Adversarial Nets with Policy Gradient, AAAI 2017
• Sai Rajeswar, Sandeep Subramanian, Francis Dutil, Christopher Pal, Aaron
Courville, Adversarial Generation of Natural Language, arXiv, 2017
• Ofir Press, Amir Bar, Ben Bogin, Jonathan Berant, Lior Wolf, Language
Generation with Recurrent Generative Adversarial Networks without Pre-
training, ICML workshop, 2017
237.
Reference
• Sequence Generation
•Zhen Xu, Bingquan Liu, Baoxun Wang, Chengjie Sun, Xiaolong Wang,
Zhuoran Wang, Chao Qi , Neural Response Generation via GAN with an
Approximate Embedding Layer, EMNLP, 2017
• Alex Lamb, Anirudh Goyal, Ying Zhang, Saizheng Zhang, Aaron
Courville, Yoshua Bengio, Professor Forcing: A New Algorithm for Training
Recurrent Networks, NIPS, 2016
• Yizhe Zhang, Zhe Gan, Kai Fan, Zhi Chen, Ricardo Henao, Dinghan
Shen, Lawrence Carin, Adversarial Feature Matching for Text Generation,
ICML, 2017
• Jiaxian Guo, Sidi Lu, Han Cai, Weinan Zhang, Yong Yu, Jun Wang, Long Text
Generation via Adversarial Training with Leaked Information, AAAI, 2018
• Kevin Lin, Dianqi Li, Xiaodong He, Zhengyou Zhang, Ming-Ting Sun,
Adversarial Ranking for Language Generation, NIPS, 2017
• William Fedus, Ian Goodfellow, Andrew M. Dai, MaskGAN: Better Text
Generation via Filling in the______, ICLR, 2018
238.
Reference
• Sequence Generation
•Yi-Lin Tuan, Hung-Yi Lee, Improving Conditional Sequence Generative
Adversarial Networks by Stepwise Evaluation, TASLP, 2019
• Jingjing Xu, Xuancheng Ren, Junyang Lin, Xu Sun, Diversity-Promoting GAN:
A Cross-Entropy Based Generative Adversarial Network for Diversified Text
Generation, EMNLP, 2018
• Sidi Lu, Yaoming Zhu, Weinan Zhang, Jun Wang, Yong Yu, Neural Text
Generation: Past, Present and Beyond, arXiv, 2018
• Yaoming Zhu, Sidi Lu, Lei Zheng, Jiaxian Guo, Weinan Zhang, Jun
Wang, Yong Yu, Texygen: A Benchmarking Platform for Text Generation
Models, arXiv, 2018
• Stanislau Semeniuta, Aliaksei Severyn, Sylvain Gelly, On Accurate Evaluation
of GANs for Language Generation, arXiv, 2018
• Guy Tevet, Gavriel Habib, Vered Shwartz, Jonathan Berant, Evaluating Text
GANs as Language Models, arXiv, 2018
• Massimo Caccia, Lucas Caccia, William Fedus, Hugo Larochelle, Joelle
Pineau, Laurent Charlin, Language GANs Falling Short, arXiv, 2018
239.
Reference
• Sequence Generation
•Zhen Yang, Wei Chen, Feng Wang, Bo Xu, Improving Neural Machine
Translation with Conditional Sequence Generative Adversarial Nets, NAACL,
2018
• Lijun Wu, Yingce Xia, Li Zhao, Fei Tian, Tao Qin, Jianhuang Lai, Tie-Yan Liu,
Adversarial Neural Machine Translation, arXiv 2017
• Linqing Liu, Yao Lu, Min Yang, Qiang Qu, Jia Zhu, Hongyan Li, Generative
Adversarial Network for Abstractive Text Summarization, AAAI 2018
• Rakshith Shetty, Marcus Rohrbach, Lisa Anne Hendricks, Mario Fritz, Bernt
Schiele, Speaking the Same Language: Matching Machine to Human
Captions by Adversarial Training, ICCV 2017
• Xiaodan Liang, Zhiting Hu, Hao Zhang, Chuang Gan, Eric P. Xing, Recurrent
Topic-Transition GAN for Visual Paragraph Generation, arXiv 2017
• Weili Nie, Nina Narodytska, Ankit Patel, RelGAN: Relational Generative
Adversarial Networks for Text Generation, ICLR 2019
240.
Reference
• Sequence Generation
•Ching-Ting Chang, Shun-Po Chuang, Hung-Yi Lee, "Code-switching Sentence
Generation by Generative Adversarial Networks and its Application to Data
Augmentation", INTERSPEECH 2019
• Cyprien de Masson d'Autume, Mihaela Rosca, Jack Rae, Shakir Mohamed,
Training language GANs from Scratch, arXiv 2019
241.
Reference
• Text StyleTransfer
• Zhenxin Fu, Xiaoye Tan, Nanyun Peng, Dongyan Zhao, Rui Yan, Style
Transfer in Text: Exploration and Evaluation, AAAI, 2018
• Tianxiao Shen, Tao Lei, Regina Barzilay, and Tommi Jaakkola, Style Transfer
from Non-Parallel Text by Cross-Alignment, NIPS 2017
• Chih-Wei Lee, Yau-Shian Wang, Tsung-Yuan Hsu, Kuan-Yu Chen, Hung-Yi Lee,
Lin-shan Lee, Scalable Sentiment for Sequence-to-sequence Chatbot
Response with Performance Analysis, ICASSP, 2018
• Junbo (Jake) Zhao, Yoon Kim, Kelly Zhang, Alexander M. Rush, Yann LeCun,
Adversarially Regularized Autoencoders, arxiv, 2017
• Feng-Guang Su, Aliyah Hsu, Yi-Lin Tuan and Hung-yi Lee, "Personalized
Dialogue Response Generation Learned from Monologues", INTERSPEECH,
2019
242.
Reference
• Unsupervised AbstractiveSummarization
• Yau-Shian Wang, Hung-Yi Lee, "Learning to Encode Text as Human-
Readable Summaries using Generative Adversarial Networks", EMNLP, 2018
• Eric Chu, Peter Liu, “MeanSum: A Neural Model for Unsupervised Multi-
Document Abstractive Summarization”, ICML, 2019
• Christos Baziotis, Ion Androutsopoulos, Ioannis Konstas, Alexandros
Potamianos, “SEQ^3: Differentiable Sequence-to-Sequence-to-Sequence
Autoencoder for Unsupervised Abstractive Sentence Compression”, NAACL
2019
Reference
• Unsupervised SpeechRecognition
• Alexander H. Liu, Hung-yi Lee, Lin-shan Lee, Adversarial Training of End-to-
end Speech Recognition Using a Criticizing Language Model, ICASSP 2018
• Da-Rong Liu, Kuan-Yu Chen, Hung-Yi Lee, Lin-shan Lee, Completely
Unsupervised Phoneme Recognition by Adversarially Learning Mapping
Relationships from Audio Embeddings, INTERSPEECH, 2018
• Kuan-yu Chen, Che-ping Tsai, Da-Rong Liu, Hung-yi Lee and Lin-shan Lee,
"Completely Unsupervised Phoneme Recognition By A Generative
Adversarial Network Harmonized With Iteratively Refined Hidden Markov
Models", INTERSPEECH, 2019
• Yi-Chen Chen, Sung-Feng Huang, Chia-Hao Shen, Hung-yi Lee, Lin-shan Lee,
"Phonetic-and-Semantic Embedding of Spoken Words with Applications in
Spoken Content Retrieval", SLT, 2018
• Chih-Kuan Yeh, Jianshu Chen, Chengzhu Yu, Dong Yu, Unsupervised Speech
Recognition via Segmental Empirical Output Distribution Matching, ICLR,
2019
245.
Reference
• Unsupervised SpeechRecognition
• Takaaki Hori, Ramon Astudillo, Tomoki Hayashi, Yu Zhang, Shinji
Watanabe, Jonathan Le Roux, Cycle-consistency training for end-to-end
speech recognition, ICASSP 2019
• Murali Karthick Baskar, Shinji Watanabe, Ramon Astudillo, Takaaki
Hori, Lukáš Burget, Jan Černocký, Semi-supervised Sequence-to-sequence
ASR using Unpaired Speech and Text, INTERSPEECH 2019
• Andros Tjandra, Sakriani Sakti, Satoshi Nakamura, Listening while Speaking:
Speech Chain by Deep Learning, ASRU 2017
• Yu-An Chung, Wei-Hung Weng, Schrasing Tong, James Glass, Unsupervised
Cross-Modal Alignment of Speech and Text Embedding Spaces, NIPS, 2018
• Yu-An Chung, Wei-Hung Weng, Schrasing Tong, James Glass, Towards
Unsupervised Speech-to-Text Translation, ICASSP 2019
• Yi Ren, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, Tie-Yan Liu, Almost
Unsupervised Text to Speech and Automatic Speech Recognition, ICML
2019
246.
Reference
• Unsupervised SpeechRecognition
• Shigeki Karita , Shinji Watanabe, Tomoharu Iwata, Atsunori Ogawa, Marc
Delcroix, Semi-Supervised End-to-End Speech Recognition, INTERSPEECH,
2018
• Jennifer Drexler, James R. Glass, “Combining End-to-End and Adversarial
Training for Low-Resource Speech Recognition”, SLT 2018
• Tomoki Hayashi, Shinji Watanabe, Yu Zhang, Tomoki Toda, Takaaki
Hori, Ramon Astudillo, Kazuya Takeda, Back-Translation-Style Data
Augmentation for End-to-End ASR, SLT, 2018