Speech recognition using neural + fuzzy logic

Snehal Patel Soft Computing Research Paper En. No: 090330131025
Speech Recognition using
Neural + Fuzzy Logic
Developers & Professor:
Judith Justin
Professor, Department of Biomedical Instrumentation Engineering
Avinashilingam University, Coimbatore, India
E-mail: judithvjn@yahoo.co.in
Tel: +91-422-2658145; Fax: +91-422-2658997
Ila Vennila
Associate Professor, Department of Electrical and Electronics Engineering
P.S.G. College of Technology, Coimbatore, India
E-mail: iven@eee.psgtech.ac.in
Joe Tebelskis
May 1995
CMU-CS-95-142
School of Computer Science
Carnegie Mellon University
Pittsburgh, Pennsylvania 15213-3890
Veera Ala-Keturi
Helsinki University of Technology
Veera.Ala-Keturi@hut.fi

ABSTRACT
This thesis examines how artificial neural networks can benefit a large vocabulary, speaker
independent, continuous speech recognition system. Currently, most speech recognition Systems
are based on hidden Markov models (HMMs), a statistical framework that supports both acoustic
and temporal modelling. Despite their state-of-the-art performance, HMMs Make a number of
suboptimal modelling assumptions that limit their potential effectiveness. Neural networks avoid
many of these assumptions, while they can also learn complex functions, generalize effectively,
tolerate noise, and support parallelism. While neural networks can readily be applied to acoustic
modelling, it is not yet clear how they can be used for temporal modeling. Therefore, we explore a
class of systems called NN-HMM hybrids, in which neural networks perform acoustic modeling,
and HMMs perform temporal modeling. We argue that a NN-HMM hybrid has several theoretical
advantages over a pure HMM system, including better acoustic modeling accuracy, better context
sensitivity, more natural discrimination and a more economical use of parameters. These
advantages are confirmed experimentally by a NN-HMM hybrid that we developed, based on
context-independent phoneme models, that achieved 90.5% word accuracy on the Resource
Management database, in contrast to only 86.0% accuracy achieved by a pure HMM under similar
conditions.
In this paper, we compare the performance of recognition of short sentences of speech using
Hidden Markov models (HMM) in Artificial Neural Networks (ANN) and Fuzzy Logic. The data sets
used are sentences from The DARPA TIMIT Acoustic- Phonetic Continuous Speech Corpus.
Currently, most speech recognition systems are based on Hidden Markov Models, a statistical
framework that supports both acoustic and temporal modeling. Despite their state-of-the-art
performance, HMMs make a number of suboptimal modeling assumptions that limit their
potential effectiveness. Neural networks avoid many of these assumptions, while they can also
learn complex functions, generalize effectively, tolerate noise, and support parallelism. The
recognition process consists of the Training phase and the Testing (Recognition) phase. The audio
files from the speech corpus are preprocessed and features like Short Time Average Zero Crossing
Rate, Pitch Period, Mel Frequency Cepstral Coefficients (MFCC), Formants and Modulation Index
are extracted. The model database is created from the feature vector using HMM and is trained
with Radial Basis Function Neural Network (RBFNN) algorithm. During recognition the test set
model is obtained which is compared with the database model. The same sets of audio files are
trained for the speech recognition using HMM/Fuzzy and the fuzzy knowledge base is created
using a fuzzy controller. During the recognition phase, the feature vector is compared with the
knowledge base and the recognition is made. From the recognized outputs, the recognition
accuracy (%) is compared and the best performing model is identified. Recognition accuracy (%)
using Radial Basis Function Neural Networks were found to be superior to recognition using Fuzzy.
Keywords: Speech Recognition, Hidden Markov Model, Radial Basis Function Neural
Network, Fuzzy, Sentence Recognition, Recognition Accuracy.

INTRODUCTION
Speech is a natural mode of communication for people. We learn all the relevant skills during early
childhood, without instruction, and we continue to rely on speech communication throughout our
lives. It comes so naturally to us that we don’t realize how complex a phenomenon speech is. The
human vocal tract and articulators are biological organs with nonlinear properties, whose
operation is not just under conscious control but also affected by factors ranging from gender to
upbringing to emotional state. As a result, vocalizations can vary widely in terms of their accent,
pronunciation, articulation, roughness, nasality, pitch, volume, and speed; moreover, during
transmission, our irregular speech patterns can be further distorted by background noise and
echoes, as well as electrical characteristics (if telephones or other electronic equipment are used).
All these sources of variability make speech recognition, even more than speech generation, a very
complex problem.
Yet people are so comfortable with speech that we would also like to interact with our computers
via speech, rather than having to resort to primitive interfaces such as keyboards and pointing
devices. A speech interface would support many valuable applications — for example, telephone
directory assistance, spoken database querying for novice users, “handsbusy” applications in
medicine or fieldwork, office dictation devices, or even automatic voice translation into foreign
languages. Such tantalizing applications have motivated research in automatic speech recognition
since the 1950’s. Yet computers are still nowhere near the level of human performance at speech
recognition, and it appears that further significant advances will require some new insights.
What makes people so good at recognizing speech? Intriguingly, the human brain is known to be
wired differently than a conventional computer; in fact it operates under a radically different
computational paradigm. While conventional computers use a very fast & complex central
processor with explicit program instructions and locally addressable memory, by contrast the
human brain uses a massively parallel collection of slow & simple processing elements (neurons),
densely connected by weights (synapses) whose strengths are modified with experience, directly
supporting the integration of multiple constraints, and providing a distributed form of associative
memory.
The brain’s impressive superiority at a wide range of cognitive skills, including speech recognition,
has motivated research into its novel computational paradigm since the 1940’s, on the assumption
that brain like models may ultimately lead to brain like performance on many complex tasks. This
fascinating research area is now known as connectionism, or the study of artificial neural
networks. The history of this field has been erratic (and laced with hyperbole), but by the mid-
1980’s, the field had matured to a point where it became realistic to begin applying connectionist
models to difficult tasks like speech recognition. By 1990 (when this thesis was proposed), many
researchers had demonstrated the value of neural networks for important subtasks like phoneme
recognition and spoken digit recognition, but it was still unclear whether connectionist techniques
would scale up to large speech recognition tasks.
This thesis demonstrates that neural networks can indeed form the basis for a general purpose
speech recognition system, and that neural networks offer some clear advantages over
conventional techniques.

Speech Recognition
What is the current state of the art in speech recognition? This is a complex question, because a
system’s accuracy depends on the conditions under which it is evaluated: under sufficiently
narrow conditions almost any system can attain human-like accuracy, but it’s much harder to
achieve good accuracy under general conditions. The conditions of evaluation — and hence the
accuracy of any system — can vary along the following dimensions:
 Vocabulary size and confusability. As a general rule, it is easy to discriminate among a
small set of words, but error rates naturally increase as the vocabulary size grows. For
example, the 10 digits “zero” to “nine” can be recognized essentially perfectly (Doddington
1989), but vocabulary sizes of 200, 5000, or 100000 may have error rates of 3%, 7%, or 45%
(Itakura 1975, Miyatake 1990, Kimura 1990). On the other hand, even a small vocabulary
can be hard to recognize if it contains confusable words. For example, the 26 letters of the
English alphabet (treated as 26 “words”) are very difficult to discriminate because they
contain so many confusable words (most notoriously, the E-set: “B, C, D, E, G, P, T, V, Z”);
an 8% error rate is considered good for this vocabulary (Hild & Waibel 1993).
 Speaker dependence vs. independence. By definition, a speaker dependent system is
intended for use by a single speaker, but a speaker independent system is intended for use
by any speaker. Speaker independence is difficult to achieve because a system’s
parameters become tuned to the speaker(s) that it was trained on, and these parameters
tend to be highly speaker-specific. Error rates are typically 3 to 5 times higher for speaker
independent systems than for speaker dependent ones (Lee 1988). Intermediate between
speaker dependent and independent systems, there are also multi-speaker systems
intended for use by a small group of people, and speaker-adaptive systems which tune
themselves to any speaker given a small amount of their speech as enrolment data.
 Isolated, discontinuous, or continuous speech. Isolated speech means single words;
discontinuous speech means full sentences in which words are artificially separated by
silence; and continuous speech means naturally spoken sentences. Isolated and
discontinuous speech recognition is relatively easy because word boundaries are
detectable and the words tend to be cleanly pronounced. Continuous speech is more
difficult, however, because word boundaries are unclear and their pronunciations are more
corrupted by articulation, or the slurring of speech sounds, which for example causes a
phrase like “could you” to sound like “could you”. In a typical evaluation, the word error
rates for isolated and continuous speech were 3% and 9%, respectively (Bahl et al 1981).
 Task and language constraints. Even with a fixed vocabulary, performance will vary with
the nature of constraints on the word sequences that are allowed during recognition. Some
constraints may be task-dependent (for example, an airlinequerying application may
dismiss the hypothesis “The apple is red”); other constraints may be semantic (rejecting
“The apple is angry”), or syntactic (rejecting “Red is apple the”). Constraints are often
represented by a grammar, which ideally filters out unreasonable sentences so that the
speech recognizer evaluates only plausible sentences. Grammars are usually rated by their
perplexity, a number that indicates the grammar’s average branching factor (i.e., the
number of words that can follow any given word). The difficulty of a task is more reliably
measured by its perplexity than by its vocabulary size.

 Read vs. spontaneous speech. Systems can be evaluated on speech that is either read
from prepared scripts, or speech that is uttered spontaneously. Spontaneous speech is
vastly more difficult, because it tends to be peppered with disfluencies like “uh” and “um”,
false starts, incomplete sentences, stuttering, coughing, and laughter; and moreover, the
vocabulary is essentially unlimited, so the system must be able to deal intelligently with
unknown words (e.g., detecting and flagging their presence, and adding them to the
vocabulary, which may require some interaction with the user).
 Adverse conditions. A system’s performance can also be degraded by a range of adverse
conditions (Furui 1993). These include environmental noise (e.g., noise in a car or a
factory); acoustical distortions (e.g, echoes, room acoustics); different microphones (e.g.,
close-speaking, omnidirectional, or telephone); limited frequency bandwidth (in telephone
transmission); and altered speaking manner (shouting, whining, speaking quickly, etc.).
In order to evaluate and compare different systems under well-defined conditions, a number of
standardized databases have been created with particular characteristics. For example, one
database that has been widely used is the DARPA Resource Management database — a large
vocabulary (1000 words), speaker-independent, continuous speech database, consisting of 4000
training sentences in the domain of naval resource management, read from a script and recorded
under benign environmental conditions; testing is usually performed using a grammar with a
perplexity of 60. Under these controlled conditions, state-of-the-art performance is about 97%
word recognition accuracy (or less for simpler systems). We used this database, as well as two
smaller ones, in our own research.
The central issue in speech recognition is dealing with variability. Currently, speech recognition
systems distinguish between two kinds of variability: acoustic and temporal. Acoustic variability
covers different accents, pronunciations, pitches, volumes, and so on, while temporal variability
covers different speaking rates. These two dimensions are not completely independent — when a
person speaks quickly, his acoustical patterns become distorted as well — but it’s a useful
simplification to treat them independently.
Of these two dimensions, temporal variability is easier to handle. An early approach to temporal
variability was to linearly stretch or shrink (“warp”) an unknown utterance to the duration of a
known template. Linear warping proved inadequate, however, because utterances can accelerate
or decelerate at any time; instead, nonlinear warping was obviously required. Soon an efficient
algorithm known as Dynamic Time Warping was proposed as a solution to this problem. This
algorithm (in some form) is now used in virtually every speech recognition system, and the
problem of temporal variability is considered to be largely solved1.
Acoustic variability is more difficult to model, partly because it is so heterogeneous in nature.
Consequently, research in speech recognition has largely focused on efforts to model acoustic
variability. Past approaches to speech recognition have fallen into three main categories:
 Template-based approaches, in which unknown speech is compared against a set of
prerecorded words (templates), in order to find the best match. This has the advantage of
using perfectly accurate word models; but it also has the disadvantage that the

prerecorded templates are fixed, so variations in speech can only be modeled by using
many templates per word, which eventually becomes impractical.
 Knowledge-based approaches, in which “expert” knowledge about variations in speech is
hand-coded into a system. This has the advantage of explicitly modeling variations in
speech; but unfortunately such expert knowledge is difficult to obtain and use successfully,
so this approach was judged to be impractical, and automatic learning procedures were
sought instead.
 Statistical-based approaches, in which variations in speech are modeled statistically (e.g.,
by Hidden Markov Models, or HMMs), using automatic learning procedures. This approach
represents the current state of the art. The main disadvantage of statistical models is that
they must make a priori modeling assumptions, which are liable to be inaccurate,
handicapping the system’s performance. We will see that neural networks help to avoid
this problem.

Review of Speech Recognition
In this chapter we will present a brief review of the field of speech recognition. After reviewing
some fundamental concepts, we will explain the standard Dynamic Time Warping algorithm, and
then discuss Hidden Markov Models in some detail, offering a summary of the algorithms,
variations, and limitations that are associated with this dominant technology.
Fundamentals of Speech Recognition
Speech recognition is a multileveled pattern recognition task, in which acoustical signals are
examined and structured into a hierarchy of subword units (e.g., phonemes), words, phrases, and
sentences. Each level may provide additional temporal constraints, e.g., known word
pronunciations or legal word sequences, which can compensate for errors or uncertainties at
lower levels. This hierarchy of constraints can best be exploited by combining decisions
probabilistically at all lower levels, and making discrete decisions only at the highest level.
The structure of a standard speech recognition system is illustrated in Figure 2.1. The elements are
as follows:
 Raw speech. Speech is typically sampled at a high frequency, e.g., 16 KHz over a
microphone or 8 KHz over a telephone. This yields a sequence of amplitude values over
time.
 Signal analysis. Raw speech should be initially transformed and compressed, in order to
simplify subsequent processing. Many signal analysis techniques are available which can
extract useful features and compress the data by a factor of ten without losing any
important information. Among the most popular:
o Fourier analysis (FFT) yields discrete frequencies over time, which can be
interpreted visually. Frequencies are often distributed using a Mel scale, which is
linear in the low range but logarithmic in the high range, corresponding to
physiological characteristics of the uman ear.
o Perceptual Linear Prediction (PLP) is also physiologically motivated, but yields
coefficients that cannot be interpreted visually.
o Linear Predictive Coding (LPC) yields coefficients of a linear equation that
approximate the recent history of the raw speech values.
o Cepstral analysis calculates the inverse Fourier transform of the logarithm of the
power spectrum of the signal.
In practice, it makes little difference which technique is used1. Afterwards, procedures such as
Linear Discriminant Analysis (LDA) may optionally be applied to further reduce the dimensionality
of any representation, and to decorrelate the coefficients.

 Speech frames. The result of signal analysis is a sequence of speech frames, typically at 10
msec intervals, with about 16 coefficients per frame. These frames may be augmented by
their own first and/or second derivatives, providing explicit information about speech
dynamics; this typically leads to improved performance. The speech frames are used for
acoustic analysis.

 Acoustic models. In order to analyze the speech frames for their acoustic content, we need
a set of acoustic models. There are many kinds of acoustic models, varying in their
representation, granularity, context dependence, and other properties.
Figure 2.3 shows two popular representations for acoustic models. The simplest is a template,
which is just a stored sample of the unit of speech to be modeled, e.g., a recording of a word. An
unknown word can be recognized by simply comparing it against all known templates, and finding
the closest match. Templates have two major drawbacks: (1) they cannot model acoustic
variabilities, except in a coarse way by assigning multiple templates to each word; and (2) in
practice they are limited to whole-word models, because it’s hard to record or segment a sample
shorter than a word — so templates are useful only in small systems which can afford the luxury of
using whole-word models. A more flexible representation, used in larger systems, is based on
trained acoustic models, or states. In this approach, every word is modeled by a sequence of
trainable states, and each state indicates the sounds that are likely to be heard in that segment of
the word, using a probability distribution over the acoustic space. Probability distributions can be
modelled parametrically, by assuming that they have a simple shape (e.g., a Gaussian distribution)
and then trying to find the parameters that describe it; or non-parametrically, by representing the
distribution directly (e.g., with a histogram over a quantization of the acoustic space, or, as we
shall see, with a neural network).

Acoustic models also vary widely in their granularity and context sensitivity. Figure 2.4 shows a
chart of some common types of acoustic models, and where they lie along these dimensions. As
can be seen, models with larger granularity (such as word or syllable models) tend to have greater
context sensitivity. Moreover, models with the greatest context sensitivity give the best word
recognition accuracy —if those models are well trained. Unfortunately, the larger the granularity
of a model, the poore it will be trained, because fewer samples will be available for training it. For
this reason, word and syllable models are rarely used in highperformance systems; much more
common are triphone or generalized triphone models. Many systems also use monophone models
(sometimes simply called phoneme models), because of their relative simplicity.
During training, the acoustic models are incrementally modified in order to optimize the overall
performance of the system. During testing, the acoustic models are left unchanged.
 Acoustic analysis and frame scores. Acoustic analysis is performed by applying each
acoustic model over each frame of speech, yielding a matrix of frame scores, as shown in
Figure 2.5. Scores are computed according to the type of acoustic model that is being used.
For template-based acoustic models, a score is typically the Euclidean distance between a
template’s frame and an unknown frame. For state-based acoustic models, a score
represents an emission probability, i.e., the likelihood of the current state generating the
current frame, as determined by the state’s parametric or non-parametric function.
Time alignment. Frame scores are converted to a word sequence by identifying a sequence of
acoustic models, representing a valid word sequence, which gives the best total score along an

alignment path through the matrix1, as illustrated in Figure 2.5. The process of searching for the
best alignment path is called time alignment.
An alignment path must obey certain sequential constraints which reflect the fact that speech
always goes forward, never backwards. These constraints are manifested both within and
between words. Within a word, sequential constraints are implied by the sequence of frames (for
template-based models), or by the sequence of states (for state-based models) that comprise the
word, as dictated by the phonetic pronunciations in a dictionary, for example. Between words,
sequential constraints are given by a grammar, indicating what words may follow what other
words.
Time alignment can be performed efficiently by dynamic programming, a general algorithm which
uses only local path constraints, and which has linear time and space requirements. (This general
algorithm has two main variants, known as Dynamic Time Warping (DTW) and Viterbi search,
which differ slightly in their local computations and in their optimality criteria.)
In a state-based system, the optimal alignment path induces a segmentation on the word
sequence, as it indicates which frames are associated with each state. This segmentation can be
used to generate labels for recursively training the acoustic models on corresponding frames.

Dynamic Time Warping
In this section we motivate and explain the Dynamic Time Warping algorithm, one of the oldest
and most important algorithms in speech recognition (Vintsyuk 1971, Itakura 1975, Sakoe and
Chiba 1978).
The simplest way to recognize an isolated word sample is to compare it against a number of stored
word templates and determine which is the “best match”. This goal is complicated by a number of
factors. First, different samples of a given word will have somewhat different durations. This
problem can be eliminated by simply normalizing the templates and the unknown speech so that
they all have an equal duration. However, another problem is that the rate of speech may not be
constant throughout the word; in other words, the optimal alignment between a template and the
speech sample may be nonlinear. Dynamic Time Warping (DTW) is an efficient method for finding
this optimal nonlinear alignment.
DTW is an instance of the general class of algorithms known as dynamic programming. Its time
and space complexity is merely linear in the duration of the speech sample and the vocabulary
size. The algorithm makes a single pass through a matrix of frame scores while computing locally
optimized segments of the global alignment path. (See Figure 2.6.) If D(x,y) is the Euclidean
distance between frame x of the speech sample and frame y of the reference template, and if
C(x,y) is the cumulative score along an optimal alignment path that leads to (x,y), then
The resulting alignment path may be visualized as a low valley of Euclidean distance scores,
meandering through the hilly landscape of the matrix, beginning at (0, 0) and ending at the final
point (X, Y). By keeping track of backpointers, the full alignment path can be recovered by tracing
backwards from (X, Y). An optimal alignment path is computed for each reference word template,
and the one with the lowest cumulative score is considered to be the best match for the unknown
speech sample.
There are many variations on the DTW algorithm. For example, it is common to vary the local path
constraints, e.g., by introducing transitions with slope 1/2 or 2, or weighting the transitions in
various ways, or applying other kinds of slope constraints (Sakoe and Chiba 1978). While the
reference word models are usually templates, they may be state-based models (as shown
previously in Figure 2.5). When using states, vertical transitions are often disallowed (since there
are fewer states than frames), and often the goal is to maximize the cumulative score, rather than
to minimize it.
A particularly important variation of DTW is an extension from isolated to continuous speech. This
extension is called the One Stage DTW algorithm (Ney 1984). Here the goal is to find the optimal
alignment between the speech sample and the best sequence of reference words (see Figure 2.5).
The complexity of the extended algorithm is still linear in the length of the sample and the
vocabulary size. The only modification to the basic DTW algorithm Is that at the beginning of each

reference word model (i.e., its first frame or state), the diagonal path is allowed to point back to
the end of all reference word models in the preceding frame.
Local backpointers must specify the reference word model of the preceding point, so that the
optimal word sequence can be recovered by tracing backwards from the final point ( W, X, Y) of
the word W with the best final score. Grammars can be imposed on continuous speech recognition
by restricting the allowed transitions at word boundaries.

Pre-processing and Feature Extraction
Human speech can be represented as an analog wave that varies over time. The height of the
wave represents intensity (loudness), and the shape of the wave represents frequency (pitch). The
properties of the speech signal changes relatively slowly with time. This allows examination of a
Short-time window of speech to extract parameters presumed to remain fixed for the duration of
the window. The signal must be divided into successive windows or analysis frames so that the
parameters can be calculated often enough to follow the relevant changes. The result of signal
analysis is a sequence of speech frames. To extract the features from the speech signal, the signal
must be preprocessed and divided into successive windows or analysis frames.
Each sentence was taken through different stages of preprocessing which included Preemphasis,
Frame Processing and Windowing [5, 8]. The higher frequencies of the speech signal are
generally weak. As a result there may not be high frequency energy present to extract features at
the upper end of the frequency range. Pre-emphasis is used to boost the energy of the high
frequency signals. Frame blocking is a process adopted to split the speech signal into frames
The speech samples are segmented into 32 ms frames with each frame having 50% overlap
with the adjacent frames. The next step in preprocessing is to window each individual frame so as
to minimize the signal discontinuities at the beginning and end of the frame. To minimize the
signal discontinuities Hamming window is used which has the form.
The purpose of feature extraction is to represent the speech signal by a finite number of
measures of the signal. It gives the invariant representations in the signal. The features selected
are the Short Time Average Zero Crossing Rate [7], Pitch Period Computation, Mel Frequency
Cepstral Coefficients (MFCC), Formants and Modulation Index. The more features we use, the
better the representation.

Hidden Markov Models
The most flexible and successful approach to speech recognition so far has been Hidden Markov
Models (HMMs). In this section we will present the basic concepts of HMMs, describe the
algorithms for training and using them, discuss some common variations, and review the problems
associated with HMMs.
A Hidden Markov Model is a collection of states connected by transitions, as illustrated in Figure
2.7. It begins in a designated initial state. In each discrete time step, a transition is taken into a
new state, and then one output symbol is generated in that state. The choice of transition and
output symbol are both random, governed by probability distributions. The HMM can be thought
of as a black box, where the sequence of output symbols generated over time is observable, but
the sequence of states visited over time is hidden from view. This is why it’s called a Hidden
Markov Model.
HMMs have a variety of applications. When an HMM is applied to speech recognition, the states
are interpreted as acoustic models, indicating what sounds are likely to be heard during their
corresponding segments of speech; while the transitions provide temporal constrints, indicating
how the states may follow each other in sequence. Because speech always goes forward in time,
transitions in a speech application always go forward (or make a self-loop, allowing a state to have
arbitrary duration). Figure 2.8 illustrates how states and transitions in an HMM can be structured
hierarchically, in order to represent phonemes, words, and sentences.
Formally, an HMM consists of the following elements:
{s} = A set of states.
{aij} = A set of transition probabilities, where aij is the probability of taking the transition from state
i to state j.
{bi(u)} = A set of emission probabilities, where bi is the probability distribution over the acoustic
space describing the likelihood of emitting1 each possible sound u while in state i.

Since a and b are both probabilities, they must satisfy the following properties:
In using this notation we implicitly confine our attention to First-Order HMMs, in which a and b
depend only on the current state, independent of the previous history of the state sequence. This
assumption, almost universally observed, limits the number of trainable parameters and makes
the training and testing algorithms very efficient, rendering HMMs useful for speech recognition.
Algorithms
There are three basic algorithms associated with Hidden Markov Models:
• the forward algorithm, useful for isolated word recognition;
• the Viterbi algorithm, useful for continuous speech recognition; and
• the forward-backward algorithm, useful for training an HMM.

The Forward Algorithm
In order to perform isolated word recognition, we must be able to evaluate the probability that a
given HMM word model produced a given observation sequence, so that we can compare the
scores for each word model and choose the one with the highest score. More formally: given an
HMM model M, consisting of {s}, {aij}, and {bi(u)}, we must compute the probability that it
generated the output sequence = (y1, y2, y3, ..., yT). Because every state i can generate each output
symbol u with probability bi(u), every state sequence of length T contributes something to the
total probability. A brute force algorithm would simply list all possible state sequences of length T,
and accumulate their probabilities of generating ; but this is clearly an exponential algorithm, and
is not practical.
A much more efficient solution is the Forward Algorithm, which is an instance of the class of
algorithms known as dynamic programming, requiring computation and storage that are only
linear in T. First, we define aj(t) as the probability of generating the partial sequence , ending up in
state j at time t. aj(t=0) is initialized to 1.0 in the initial state, and 0.0 in all other states. If we have
already computed ai(t-1) for all i in the previous time frame t-1, then aj(t) can be computed
recursively in terms of the incremental probability of entering state j from each i while generating
the output symbol yt (see Figure 2.9):
If F is the final state, then by induction we see that aF(T) is the probability that the HMM generated
the complete output sequence .
Figure 2.10 shows an example of this algorithm in operation, computing the probability that the
output sequence =(A,A,B) could have been generated by the simple HMM presented earlier. Each
cell at (t,j) shows the value of aj(t), using the given values of a and b. The computation proceeds
from the first state to the last state within a time frame, before proceeding to the next time frame.
In the final cell, we see that the probability that this particular HMM generates the sequence
(A,A,B) is .096.

The Viterbi Algorithm
While the Forward Algorithm is useful for isolated word recognition, it cannot be applied to
continuous speech recognition, because it is impractical to have a separate HMM for each possible
sentence. In order to perform continuous speech recognition, we should instead infer the actual
sequence of states that generated the given observation sequence; from the state sequence we
can easily recover the word sequence. Unfortunately the actual state sequence is hidden (by
definition), and cannot be uniquely identified; after all, any path could have produced this output
sequence, with some small probability. The best we can do is to find the one state sequence that
was most likely to have generated the observation sequence. As before, we could do this by
evaluating all possible state sequences and reporting the one with the highest probability, but this
would be an exponential and hence infeasible algorithm.
A much more efficient solution is the Viterbi Algorithm, which is again based on dynamic
programming. It is very similar to the Forward Algorithm, the main difference being that
instead of evaluating a summation at each cell, we evaluate the maximum:
This implicitly identifies the single best predecessor state for each cell in the matrix. If we explicitly
identify that best predecessor state, saving a single backpointer in each cell in the matrix, then by
the time we have evaluated vF(T) at the final state at the final time frame, we can retrace those
backpointers from the final cell to reconstruct the whole state sequence. Figure 2.11 illustrates
this process. Once we have the state sequence (i.e., an alignment path), we can trivially recover
the word sequence.

The Forward-Backward Algorithm
In order to train an HMM, we must optimize a and b with respect to the HMM’s likelihood of
generating all of the output sequences in the training set, because this will maximize the HMM’s
chances of also correctly recognizing new data. Unfortunately this is a difficult roblem; it has no
closed form solution. The best that can be done is to start with some initial values for a and b, and
then to iteratively modify a and b by reestimating and improving them, until some stopping
criterion is reached. This general method is called Estimation- Maximization (EM). A popular
instance of this general method is the Forward-Backward Algorithm (also known as the Baum-
Welch Algorithm), which we now describe.
Previously we defined aj(t) as the probability of generating the partial sequence and ending up in
state j at time t. Now we define its mirror image, bj(t), as the probability of generating the
remainder of the sequence , starting from state j at time t. aj(t) is called the forward term, while
bj(t) is called the backward term. Like aj(t), bj(t) can be computed recursively, but this time in a
backward direction (see Figure 2.12):
This recursion is initialized at time T by setting bk(T) to 1.0 for the final state, and 0.0 for all other
states.
Now we define gij(t) as the probability of transitioning from state i to state j at time t, given

that the whole output sequence has been generated by the current HMM:
The numerator in the final equality can be understood by consulting Figure 2.13. The denominator
reflects the fact that the probability of generating equals the probability of generating while
ending up in any of k final states.
Now let us define N(ij) as the expected number of times that the transition from state
i to state j is taken, from time 1 to T:

Limitations of HMMs
Despite their state-of-the-art performance, HMMs are handicapped by several well-known
weaknesses, namely:
• The First-Order Assumption — which says that all probabilities depend solely on the current
state — is false for speech applications. One consequence is that HMMs have difficulty modeling
coarticulation, because acoustic distributions are in fact strongly affected by recent state history.
Another consequence is that durations are modeled inaccurately by an exponentially decaying
distribution, rather than by a more accurate Poisson or other bell-shaped distribution.
• The Independence Assumption — which says that there is no correlation between adjacent input
frames — is also false for speech applications. In accordance with this assumption, HMMs examine
only one frame of speech at a time. In order to benefit from the context of neighboring frames,
HMMs must absorb those frames into the current frame (e.g., by introducing multiple streams of
data in order to exploit delta coefficients, or using LDA to transform these streams into a single
stream).
• The HMM probability density models (discrete, continuous, and semi-continuous) have
suboptimal modeling accuracy. Specifically, discrete density HMMs suffer from quantization
errors, while continuous or semi-continuous density HMMs suffer from model mismatch, i.e., a
poor match between their a priori choice of statistical model (e.g., a mixture of K Gaussians) and
the true density of acoustic space.
• The Maximum Likelihood training criterion leads to poor discrimination between the acoustic
models (given limited training data and correspondingly limited models). Discrimination can be
improved using the Maximum Mutual Information training criterion, but this is more complex and
difficult to implement properly.
Because HMMs suffer from all these weaknesses, they can obtain good performance only by
relying on context dependent phone models, which have so many parameters that they must be
extensively shared — and this, in turn, calls for elaborate mechanisms such as senones and
decision trees (Hwang et al, 1993b). We will argue that neural networks mitigate each of the
above weaknesses (except the First Order Assumption), while they require relatively few
parameters, so that a neural network based speech recognition system can get equivalent or
better performance with less complexity.

Conclusions
This dissertation has addressed the question of whether neural networks can serve as a useful
foundation for a large vocabulary, speaker independent, continuous speech recognition system.
We succeeded in showing that indeed they can, when the neural networks are used carefully and
thoughtfully.
Neural Networks as Acoustic Models
A speech recognition system requires solutions to the problems of both acoustic modeling and
temporal modeling. The prevailing speech recognition technology, Hidden Markov Models, offers
solutions to both of these problems: acoustic modeling is provided by discrete, continuous, or
semicontinuous density models; and temporal modeling is provided by states connected by
transitions, arranged into a strict hierarchy of phonemes, words, and sentences.
While an HMM’s solutions are effective, they suffer from a number of drawbacks. Specifically, the
acoustic models suffer from quantization errors and/or poor parametric modeling assumptions;
the standard Maximum Likelihood training criterion leads to poor discrimination between the
acoustic models; the Independence Assumption makes it hard to exploit multiple input frames;
and the First-Order Assumption makes it hard to model coarticulation and duration. Given that
HMMs have so many drawbacks, it makes sense to consider alternative solutions.
Neural networks — well known for their ability to learn complex functions, generalize effectively,
tolerate noise, and support parallelism — offer a promising alternative. However, while today’s
neural networks can readily be applied to static or temporally localized pattern recognition tasks,
we do not yet clearly understand how to apply them to dynamic, temporally extended pattern
recognition tasks. Therefore, in a speech recognition system, it currently makes sense to use
neural networks for acoustic modeling, but not for temporal modeling. Based on these
considerations, we have investigated hybrid NN-HMM systems, in which neural networks are
responsible for acoustic modeling, and HMMs are responsible for temporal modeling.

Summary of Experiments
We explored two different ways to use neural networks for acoustic modeling. The first was a
novel technique based on prediction (Linked Predictive Neural Networks, or LPNN), in which each
phoneme class was modeled by a separate neural network, and each network tried to predict the
next frame of speech given some recent frames of speech; the prediction errors were used to
perform a Viterbi search for the best state sequence, as in an HMM. We found that this approach
suffered from a lack of discrimination between the phoneme classes, as all of the networks
learned to perform a similar quasi-identity mapping between the quasi-stationary frames of their
respective phoneme classes.
The second approach was based on classification, in which a single neural network tried
to classify a segment of speech into its correct class. This approach proved much more successful,
as it naturally supports discrimination between phoneme classes. Within this framework,
we explored many variations of the network architecture, input representation, speech
model, training procedure, and testing procedure.

Advantages of NN-HMM hybrids
Finally, NN-HMM hybrids offer several theoretical advantages over standard HMM speech
recognizers. Specifically:
• Modeling accuracy. Discrete density HMMs suffer from quantization errors in their input space,
while continuous or semi-continuous density HMMs suffer from model mismatch, i.e., a poor
match between the a priori choice of statistical model (e.g., a mixture of K Gaussians) and the true
density of acoustic space. By contrast, neural networks are nonparametric models that neither
suffer from quantization error nor make detailed assumptions about the form of the distribution
to be modeled. Thus a neural network can form more accurate acoustic models than an HMM.
• Context sensitivity. HMMs assume that speech frames are independent of each other, so they
examine only one frame at a time. In order to take advantage of contextual information in
neighboring frames, HMMs must artificially absorb those frames into the current frame (e.g., by
introducing multiple streams of data in order to exploit delta coefficients, or using LDA to
transform these streams into a single stream). By contrast, neural networks can naturally
accommodate any size input window, because the number of weights required in a network
simply grows linearly with the number of inputs. Thus a neural network is naturally more context
sensitive than an HMM.
• Discrimination. The standard HMM training criterion, Maximum Likelihood, does not explicitly
discriminate between acoustic models, hence the models are not optimized for the essentially
discriminative task of word recognition. It is possible to improve discrimination in an HMM by
using the Maximum Mutual Information criterion, but this is more complex and difficult to
implement properly. By contrast, discrimination is a natural property of neural networks when
they are trained to perform classification. Thus a neural network can discriminate more naturally
than an HMM.
• Economy. An HMM uses its parameters to model the surface of the density function in acoustic
space, in terms of the likelihoods P(input|class). By contrast, a neural network uses its parameters
to model the boundaries between acoustic classes, in terms of the posteriors P(class|input). Either
surfaces or boundaries can be used for classifying speech, but boundaries require fewer
parameters and thus can make better use of limited training data. For example, we have achieved
125,000 parameters (Renals et al 1992). Thus a neural network is more economical than an HMM.
HMMs are also known to be handicapped by their First-Order Assumption, i.e., the
assumption that all probabilities depend solely on the current state, independent of previous
history; this limits the HMM’s ability to model coarticulatory effects, or to model durations
accurately. Unfortunately, NN-HMM hybrids share this handicap, because the First-Order
Assumption is a property of the HMM temporal model, not of the NN acoustic model. We believe
that further research into connectionism could eventually lead to new and powerful techniques
for temporal pattern recognition based on neural networks. If and when that happens, it may
become possible to design systems that are based entirely on neural networks, potentially further
advancing the state of the art in speech recognition.

Speech recognition using neural + fuzzy logic

More Related Content

What's hot

Viewers also liked

Similar to Speech recognition using neural + fuzzy logic

Recently uploaded

Speech recognition using neural + fuzzy logic