DEEP LEARNING FOR SPEECH
RECOGNITION
Anantharaman Palacode Narayana Iyer
JNResearch
ananth@jnresearch.com
15 April 2016
REFERENCES
AGENDA
 Types of Speech Recognition and applications
 Traditional implementation pipeline
 Deep Learning for Speech Recognition
 Future directions
SPEECH APPLICATIONS
 Speech recognition:
 Hands-free in a car
 Commands for Personal assistants – e.g Siri
 Gaming
 Conversational agents
 E.g. agent for flight schedule enquiry, bookings etc
 Speaker identification
 E.g Forensics
 Extracting emotions and social meanings
 Text to speech
TYPES OF RECOGNITIONTASKS
 Isolated word recognition
 Connected words recognition
 Continuous speech recognition (LVCSR)
 The above can be realized as:
 Speaker independent implementation
 Speaker dependent implementation
SPEECH RECOGNITION IS PROBABILISTIC
Steps:
 Train the system
 Cross validate, finetune
 Test
 Deploy
Speech Recognizer
(ASR)
Speech Signal
Probabilistic match
between input and a set
of words
ISOLATED WORD RECOGNITION
 From the audio signal generate features. MFCC or
Filter banks are quite common
 Perform any additional pre-processing
 Using a code book of a given size, convert these
features in to discrete symbols.This is the vector
quantization procedure that can be implemented
with k-means clustering
 Train HMM’s using BaumWelch algorithm
 For each word in the vocabulary, instantiate a HMM
 Intuitively choose the number of states
 The set of symbols are all valid values of the code
book
 Use the HMM to predict unseen input
HMM 1
HMM 2
HMM n
Argmax λ
P(O|λ)
Observations
Predicted
Word
CONTINUOUS SPEECH RECOGNITION
• ASR for continuous speech is
traditionally built using Gaussian
Mixture Models (GMM)
• The emission probability table that
we used for discrete symbols is now
replaced by GMM
• The parameters of this model are
learnt as a part of the training using
BaumWelch procedure
KNOWLEDGE INTEGRATION FOR SPEECH
RECOGNITION
Feature
Analysis
Unit
Matching
System
Lexical
Hypothesis
Syntactic
Hypothesis
Semantic
Hypothesis
Utterence
Verifier
Speech
Recognized utterance
Inventory of
speech
recognition
units
Word
Dictition
ary
Gramm
ar
Task
Model
SOME CHALLENGES
 We don’t know the number of words
 We don’t know the boundaries
 They are fuzzy and non unique
 ForV word reference patterns and L positions there are
exponential combinatorial possibilities
USING DEEP NETWORKS FOR ASR
 Replace the GMM with a
Deep Neural Networks that
directly provides the
likelihood estimates
 Interface the DNN with a
HMM decoder
 Issues:
 We still need the HMM with
its underlying assumptions
for tractable computation
EMERGINGTRENDS
 HMM-free ASRs
 Avoids phoneme prediction and hence the need to have a
phoneme database
 Active area of research
 Current state of the art adopted by the industry uses DNN-HMM
 Future ASRs are likely to be fully neural networks based

Deep Learning For Speech Recognition

  • 1.
    DEEP LEARNING FORSPEECH RECOGNITION Anantharaman Palacode Narayana Iyer JNResearch ananth@jnresearch.com 15 April 2016
  • 2.
  • 3.
    AGENDA  Types ofSpeech Recognition and applications  Traditional implementation pipeline  Deep Learning for Speech Recognition  Future directions
  • 4.
    SPEECH APPLICATIONS  Speechrecognition:  Hands-free in a car  Commands for Personal assistants – e.g Siri  Gaming  Conversational agents  E.g. agent for flight schedule enquiry, bookings etc  Speaker identification  E.g Forensics  Extracting emotions and social meanings  Text to speech
  • 5.
    TYPES OF RECOGNITIONTASKS Isolated word recognition  Connected words recognition  Continuous speech recognition (LVCSR)  The above can be realized as:  Speaker independent implementation  Speaker dependent implementation
  • 6.
    SPEECH RECOGNITION ISPROBABILISTIC Steps:  Train the system  Cross validate, finetune  Test  Deploy Speech Recognizer (ASR) Speech Signal Probabilistic match between input and a set of words
  • 7.
    ISOLATED WORD RECOGNITION From the audio signal generate features. MFCC or Filter banks are quite common  Perform any additional pre-processing  Using a code book of a given size, convert these features in to discrete symbols.This is the vector quantization procedure that can be implemented with k-means clustering  Train HMM’s using BaumWelch algorithm  For each word in the vocabulary, instantiate a HMM  Intuitively choose the number of states  The set of symbols are all valid values of the code book  Use the HMM to predict unseen input HMM 1 HMM 2 HMM n Argmax λ P(O|λ) Observations Predicted Word
  • 8.
    CONTINUOUS SPEECH RECOGNITION •ASR for continuous speech is traditionally built using Gaussian Mixture Models (GMM) • The emission probability table that we used for discrete symbols is now replaced by GMM • The parameters of this model are learnt as a part of the training using BaumWelch procedure
  • 9.
    KNOWLEDGE INTEGRATION FORSPEECH RECOGNITION Feature Analysis Unit Matching System Lexical Hypothesis Syntactic Hypothesis Semantic Hypothesis Utterence Verifier Speech Recognized utterance Inventory of speech recognition units Word Dictition ary Gramm ar Task Model
  • 10.
    SOME CHALLENGES  Wedon’t know the number of words  We don’t know the boundaries  They are fuzzy and non unique  ForV word reference patterns and L positions there are exponential combinatorial possibilities
  • 11.
    USING DEEP NETWORKSFOR ASR  Replace the GMM with a Deep Neural Networks that directly provides the likelihood estimates  Interface the DNN with a HMM decoder  Issues:  We still need the HMM with its underlying assumptions for tractable computation
  • 12.
    EMERGINGTRENDS  HMM-free ASRs Avoids phoneme prediction and hence the need to have a phoneme database  Active area of research  Current state of the art adopted by the industry uses DNN-HMM  Future ASRs are likely to be fully neural networks based