Contents
-
How
we speak
-
Speech
sounds
-
Continuous
normal speech
-
Electronic
capture of speech
-
Extracting
acoustic information from speech
-
Speech
recognition with dynamic time-warping
-
Speech
recognition with hidden Markov modeling
-
Neural
networks
-
References
How
we speak
We
speak by using combinations of our lungs, windpipe,
larynx, and oral and nasal tracts. Vocal cords
are two folds of skin on the larynx
that blow apart and come together as we force air
through the slit (glottis) between
them. The oral tract is an irregular
tube terminated at the front by lips and at the back
by the larynx. Its cross-sectional area is varied
by muscular control of the lips, tongue, jaw and velum.
The nasal tract is a non-uniform acoustic
tube of fixed volume and length terminated at the
front by nostrils and the rear by a movable flap of
skin (velum). The velum controls acoustic
coupling between the oral and nasal tracts. Air expelled
from the lungs and forced along the trachea and through
the glottis can be controlled in different ways to
produce voiced, unvoiced,
and plosive sounds.
Voiced
sounds such as aah or oh
are produced when the vocal cords are tensed together
and vibrate as the air pressure builds up, forcing
the glottis open, and then subsides as the air passes
through. This vibration has a frequency spectrum rich
in harmonics at multiples of the fundamental frequency
(pitch). Speakers vary pitch with air pressure in
the lungs and tension on the vocal cords.
Unvoiced
sounds may be fricative or aspirated. Vocal
cords do not vibrate for either. Fricative sounds
such as s or sh
are generated at some point in the vocal tract. As
air is forced past it turbulence occurs causing a
random noise. Since the points of constriction tend
to occur near the front of the mouth, the resonances
of the vocal tract have little effect on sound being
produced. In aspirated sounds, such
as h of hello.
turbulent airflow occurs at the glottis as the vocal
cords are held slightly apart. Resonances of the vocal
tract modulate the spectrum of the random noise as
heard in whispered speech.
Plosive
sounds, such as the puh
sound at the beginning of the word pin
or the duh sound at the
beginning of din, are created when the
vocal tract is closed at some point, allowing air
pressure to build up before it is suddenly released.
This transient excitation may occur with or without
vocal cord vibration.
Resonances
produced by the tubular shape of the vocal tract are
called formants. The vocal tract may
assume many different shapes giving rise to different
resonant or formant frequency values (sounds). Formant
frequencies are constantly changing in continuous
speech.
Speech
sounds
Speech
technologists consider speech (in some languages)
as being a sequence of basic sound units called phonemes.
Phonemes, however, may not be directly observed in
a speech signal. A phoneme may have many different
sounds or allophones depending on its
surrounding phonemes. Different individuals may produce
the same string of phonemes to convey the same information,
however they sound different as a result of variations
in dialect, accent and physiology. Phonemes correspond
directly with articulatory positions and movement
(articulatory gestures). Articulatory
gestures may be either static or dynamic events. Phonemes
or articulatory gestures in the English language are
classified into vowels, semivowels, liquids, diphthongs,
nasals, etc.
Vowel
gestures are produced with static articulators. Sound
radiates from the mouth with no nasal coupling. Tongue
shape remains fairly fixed and each vowel is characterized
by the forward/backward and raised/lowered positions
of the tongue. Vowels may be classified as front,
as in feet, did,
red, and mat, middle,
as in heard, cut,
the, or back, as in card,
cod, board, wood,
and rude, depending on the forward-backward
positioning of the tongue during articulation. Acoustically,
each vowel is characterized by the values of the first
three or four resonances (formants) of the vocal tract.
Semivowels consist of glides and liquids.
Glides, i.e. went and
ran, like diphthongs, are also dynamic
sounds except that articulators move much more rapidly
from one static vowel position to another. Liquids,
i.e. let, and you are
static gestures with the oral tract partially closed
at some point.
Diphthongs,
are a combination of two vowel sounds. Although similar
to vowels, the gesture is created when the articulators
move slowly from one static vowel position to another.
Nasals
such as found in man, now
and sing are produced by vocal-cord
excitation with the vocal tract totally constricted
at some point along the oral passageway and with the
velum lowered so that sound is radiated from the nostrils.
Fricatives
are sounds produced when turbulent air-flow occurs at
a point of constriction in the vocal tract. The point
of constriction occurs near the front of the mouth and
its exact location characterizes the particular fricative
sound produced. Sound is radiated from the lips via
the front cavity. The back cavity traps energy at certain
frequencies and so introduces anti-resonances into the
perceived sound. Unvoiced fricatives,
as in fat, thin, sit,
and ship, are produced without vocal-cord
vibration whereas in their voiced counterparts, van,
this, zoo, and azure,
the vocal cords are vibrating. The phoneme as in hat
may be regarded as an unvoiced fricative even though
it does not have a voiced counterpart. It is produced
with turbulent excitation at the glottis and with the
articulatory position of the vowel which succeeds it.
Plosives are generated by forming a complete
closure of the vocal tract, allowing air pressure to
build up and then suddenly released. The presence or
absence of vocal-cord vibration distinguishes the voiced
stops, bad, din, and
gone, from their unvoiced counterparts,
pin, ton, and kill.
The point of closure in the vocal tract determines the
voiced/unvoiced plosive produced. Plosives are characterized
by transient bursts of energy and as a result their
properties are highly influenced by the sounds which
precede or succeed them.
Affricates
are either unvoiced as in the word church
or voiced as in the word judge.
These sounds are produced when a stop and fricative
consonant are both shortened and combined.
Continuous
normal speech
Target
articulatory positions for many of the gestures will
never be actually reached. As a particular gesture
is being produced, the next is already being anticipated
and this modifies the way in which the articulators
move. To a lesser extent, an articulatory gesture
may also depend on the preceding one. This phenomenon
is known as co-articulation and results
in a smearing of sounds into one another. It also
gives rise to an effect called allophonic variation,
that is, each phoneme may have many different allophones
depending on its surrounding phonemes. For example,
the l sound in the word leap
and the word lawn are the same phoneme,
yet are different allophones. Co-articulation is mainly
responsible for making speech sound natural rather
than stilted. Super imposed on the basic sequence
of gestures are variations in intonation or pitch,
rhythm or timing and intensity or loudness. These
variations in a utterance are collectively known as
prosody and can greatly influence its
meaning.
Intonation
and rhythm can be combined to emphasize certain gestures
and produce an effect called stress.
For example, in the sentence I don't have a
black pen but I have a blue one
stress would be put on the words black
and blue. Rhythm on its own can affect
the syntax of a word and helps to distinguish nouns
from verbs, such as extract vs. extract,
and adjectives from verbs, such as separate
vs. separate. Intonation can have an effect
on meaning too. Pitch very often rises towards the
end of a question, the extent and abruptness of the
rise depending on whether actual information or confirmation
is required in the answer. For example, contrast the
utterances It's a nice day, isn't it?
and This is the right train, isn't it?.
In the first case the pitch does not rise so significantly
at the end of the sentence whereas in the second case,
where information is required, the pitch rise is much
greater and more abrupt. Very often in normal speech
the prosodic pattern is used to convey emotional state
and attitude of the speaker. Through our tone
of voice we can indicate sarcasm, anger, joy,
sadness, fear etc.
Electronic
capture of speech
Speech
is captured by a sound-responsive element in a microphone
that converts variable sound pressure into equivalent
variations of an electrical signal, i.e. current or
voltage. This analog signal is then sampled and quantized
into a digital bit stream (format). Sampling is the
process of obtaining values of the analog signal at
discrete instants of time T, and quantization
refers to the conversion of the amplitude at each
sampling instant into a discrete binary number with
a specified bit-length. This two-stage process is
sometimes referred to as pulse code modulation (PCM).
The number of samples per second (frequency) S
in Hz is equal to the reciprocal of the sample period,
i.e. S = 1/T. Nyquist's (Kotelnikov and
Shannon were also involved) sampling theorem states
that the sampling frequency must be at least twice
the highest frequency component present in the signal.
If fewer samples are used a phenomenon known as aliasing
occurs where a lower frequency signal may appear upon
re-construction.
Speech
generally contains frequency components with significant
energies up to about 10 kHz. Spectra of the majority
of speech sounds however only have significant spectral
energy content up to about 5 kHz. Only the fricative
and aspirated sounds exhibit significant spectral
energy above this value. Speech limited to 5 kHz is
perfectly intelligible and suffers no significant
degradation. Telephone speech is limited to 3.3 kHz
and although the transmission of speech through the
telephone network degrades its quality significantly,
it still remains largely intelligible, since the information-bearing
formants are concentrated in the region below
3.3
kHz. The sampling rate for speech will normally lie
in the range
6-20
kHz. In all such situations, a pre-sampling or anti-aliasing
filter is required in order to remove frequency components
above the Nyquist frequency.
Quantization
involves converting the amplitude of the sample values
into digital form using a finite number of binary
digits (bits). The number of bits used affects the
speech quality as well as the number of bits per second
(bit rate) required to store or transmit the digital
signal. Uniform and logarithmic are two types of quantization
used.
For
uniform quantization the amplitude range
is divided into 2N distinct levels, where N
is the number of bits used to represent each digitized
sample. If two's complement arithmetic is used the
maximum and minimum signal values are
{2(N-1)
- 1}/2(N-1)Vref and -Vref,
where
Vref is the dc. reference voltage for the quantizer.
The quantizer step-size is equal to 2Vref/2N.
The signal-to -noise ratio (SNR) in decibels* (dB)
is equal to 1.76 + 6.02N.
*The
bel, named in honor of Alexander Graham Bell,
is defined as the common logarithm of the ratio of
two powers, P1 and P2. The decibel is given by 10
log (P2/P1).
Hence
each additional bit in the quantizer contributes approximately
6 dB improvement in dynamic range. Speech exhibits
a dynamic range of some 50-60 dB and therefore in
theory 8- or 9-bit quantization should provide the
necessary signal-to-noise ratio for good-quality speech.
The mean-square-value of a typical speech signal with
peak values approaching that of the maximum allowable
value will be much less than a maximum amplitude sine
wave. Consequently the signal-to-noise ratio is reduced.
Different speakers produce different loudness levels
and even the same speaker may produce constantly fluctuating
levels. It is virtually impossible to optimally use
the available dynamic range. For these reasons, at
least 11- or 12-bit quantization is generally used
in high-quality speech-processing applications.
Step-size
quantization is used in linear systems to accommodate
the relatively few peaks in the signal. Low-level
signals, such as fricatives will then have a large
quantization error. A solution to this problem is
to make the quantizer vary step-size in accordance
with the signal amplitude, i.e. the step-size is increased
as signal level increases. Non-uniform quantization
can provide an improved signal-to-noise ratio with
a fewer number of bits. This results in a greater
efficiency in computer storage and data transmission.
The speech signal ultimately has to be re-converted
to its original form and an exact inverse of the quantizer's
characteristics is required. The process of non-linearly
quantizing (compressing) a signal and re-construction
(expanding) it is known as companding,
a contraction of compressing-expanding.
Companding devices are available on a single chip
and are called codecs, a contraction
of coder-decoder. They normally use
an 8 kHz sampling rate at 8 bits per sample, which
corresponds to a bit-rate of 64 kbits/s.
Two
widely used companding types are the A-law
and the µ-law. The A-law is used largely
in Europe and the µ-law in the United States. Both
the A-law and the µ-law are based on a logarithmic
rather than a linear quantizer characteristic.
Log-PCM
codecs are widely used in digital telephony channels.
A sampling-rate of 8 kHz is employed and with 8 bits
allocated to each sample, the corresponding transmission
or storage rate is
The
major limitation of the quantizers described thus
far is that they cannot handle amplitude variations
in the speech signal between different speakers, or
between voiced and unvoiced segments of a signal from
the same speaker. This can be ameliorated by dynamically
adjusting the step-size of the quantizer to adapt
to the changing signal characteristics. This technique
is referred to as adaptive pulse code modulation (APCM).
The adaptation can be performed at every sample (instantaneous
adaptation) or every few samples, i.e. every
10-20 ms (syllabic adaptation). APCM
uses 1 bit per sample less than that of log-PCM for
a given signal-to-noise ratio in telephone-quality
speech. APCM provides an improvement of approximately
6 dB in signal-to- noise ratio over log-PCM.
Speech
signals may show considerable similarity between adjacent
samples, especially in regions of voiced speech. Consequently,
the difference signal formed by subtracting adjacent
samples has a lower variance and dynamic range than
the speech signal itself. For a given signal-to-noise
ratio it can be encoded using fewer bits. This feature
is exploited in differential PCM (DPCM)
by estimating the current speech sample, subtracting
it from the actual current sample and quantizing the
difference signal. The re-constructed signal is obtained
by adding the quantized difference signal to the signal
estimate. The better the prediction, the more accurate
the signal value estimate, i.e. the smaller the quantization
error in the difference signal the better the signal-to-noise
ratio. DPCM can obtain an improvement of about 5-6
dB in signal-to-noise ratio or a saving of about 1
bit per sample for a given signal-to-noise ratio.
The saving is not as large as might be expected, however,
since in actual practice the dynamic range of the
error signal can approach that of the signal itself.
Delta
modulation (DM) is a special case of
the DPCM. The bit-rate of a DM system is equal to
the sampling rate, but the sampling rate must be several
times that of a conventional PCM system. If the sampling
rate is too low, a condition known as slope
overload may occur, where the system is incapable
of tracking a fast-changing signal. The overload can
be reduced by using either a higher sampling rate
or a larger step-size. A larger step-size increases
the granular noise, due to the alternate
1s and 0s, which is particularly noticeable when no
input is present. At low signal-to-noise ratios, the
bit-rate for DM is slightly lower than that of log-PCM.
However, for signal-to-noise ratio approaching or
greater than telephone-quality, the required bit-rate
is higher.
A
substantial improvement can be obtained by adding
an adaptive quantizer and/or an adaptive predictor.
The resulting systems are all generally referred to
as adaptive differential PCM (ADPCM)
systems. Synchronization of the transmitter and receiver
is achieved by having essentially a replica of the
receiver embedded in the transmitter and using only
the transmitted difference signal in determining step-size
adaptation, in both the quantizer and the inverse
quantizer, and in predicting the next signal estimate.
Step-size adaptation is typically a mixture of both
instantaneous and syllabic adaptation and is achieved
by making the adaptation signal dependent on the magnitude
of the difference signal as well as its rate of change.
In this way, the system can adapt better to both stationary
and non-stationary speech. The predictor coefficients
may be calculated and updated every 10-20 ms, by solving
a set of linear equations. Alternatively, they may
be updated every sample, using an optimization algorithm,
which uses a hill-climbing technique to minimize the
error between the signal estimate and the actual signal.
The
complexity of an ADPCM system is directly related
to the complexity of the predictor algorithm. Low-to-medium
complexity systems use an adaptive quantizer and a
fixed predictor and are capable of reproducing speech
with slightly less than telephone-quality at 3 or
4 bits/sample at an 8 kHz sampling rate. High-complexity
systems, using an adaptive quantizer and an adaptive
predictor and operating at 32 kbits/s, can improve
the signal-to-noise ratio further and can reproduce
speech with better than telephone-quality.
Improvement
in delta modulation (DM), described above, can be
obtained by adapting the step-size of the quantizer.
In a rapidly changing signal condition, the step-size
can increase quickly up to its maximum value reducing
the possibility of overload. During silence, the step-size
is decreased to its minimum value, which determines
the level of granular noise in the idle condition.
Adaptive DM (ADM) systems using very simple algorithms
are capable of reproducting speech of very good quality
at bit-rates in the range of 32 to 48 bits/s.
The
main drawback with adaptive delta modulation is that
since step-size can vary instantaneously with sudden
changes in the input signal, the system can take a
long time to recover from transmission errors, causing
degradation of the speech quality. A solution to this
problem is to make the step-size variation slower
than the instantaneous variation in the speech signal.
This has the effect of increasing the likelihood of
slope-overload distortion, but leads to a reduction
in the level of granular noise. The step-size control
signal is effectively generated by low-pass filtering
the step-size changes indicated, by observing the
most recent (3-5) bit outputs of the quantizer. The
time-constant of the low-pass filter is typically
of the order of 5-10 ms. Systems employing these principles
are generally known as continuously variable
slope delta modulation (CVSD)
systems. CVSD systems can produce log-PCM quality
speech at about 40 kbits/s and slightly less than
telephone-quality at 32 kbits/s. This does not represent
any improvement over ADM. However, CVSD can prove
useful for communications-quality speech at bit-rates
of 16 kbits and below. Added attractions of CVSD in
this area of application are its simplicity and its
robustness to transmission errors when used on a noisy
channel.
Extracting
acoustic information from speech
The
source-filter model is the basis for many
of the techniques used to extract acoustic information
from the speech signal. It consists of a uniform tube
or pipe of length L, with a sound source
at one end to represent vocal cords and open at the
other to represent lips. It has odd frequency resonances
of 0, 30, 50, . . . etc., where
0 = c/3L. c is the velocity of sound
in air ,i.e. approximately 340 meters/second. Resonant
frequency values of 500 HZ, 1000 Hz, 1500 Hz, . .
. etc. are produced by a vocal tract (pipe) 17 cm
long. Resonances produced in the vocal tract are called
formants. In this model, the excitation
source is assumed to be linearly separable from the
transmission characteristics of the vocal tract, which
are represented by a quasi-time-invariant filter.
The speech waveform itself is then assumed to be the
output of this filter in response to the excitation
source, which is either a quasi-periodic pulse generator
(voiced sounds), a random-noise generator (unvoiced
sounds) or a mixture of both (voiced fricatives).
Speech analysis is primarily the process
of estimating the relatively slowly time-varying parameters
which specify the filter, from a speech signal that
is assumed to be the output of that filter. Other
goals include voiced/unvoiced classification and pitch-period
estimation for voiced speech.
According
to the source-filter model there is an overall
-6
dB/octave trend as frequency increases. This is a
combination of a -12 dB/octave trend due to the voiced
excitation source and +6 dB/octave trend due to radiation
from the mouth. This means that, for each doubling
in frequency, the signal amplitude, and hence the
measured vocal tract response, is reduced by a factor
of 16. It is therefore desirable to compensate for
the -6 dB/octave roll-off by pre-processing the speech
signal to give a +6 dB/octave lift in the appropriate
range so that the measured spectrum has a similar
dynamic range across the entire frequency band. This
is referred to as pre-emphasis. It is
unnecessary to apply pre-emphasis in the case of unvoiced
speech since there is no spectral trend to removed.
However, unvoiced speech is included for simplicity
of implementation, .
Speech
analysis may be performed in either the frequency-
or time-domain. The major goal is to estimate the
frequency response of the vocal tract. Techniques
involving a bank of bandpass filters (spectral
segmentation), discrete Fourier transformation (DFT)
and homomorphic or cepstral processing may be used.
Time-domain measurements such as auto correlation,
zero- crossing rate and signal energy can also be
used to extract limited but useful information about
the speech signal. Since the rate at which the speech
spectrum changes is limited by physiological constraints
-of the lips, tongue, jaw, etc., most speech analysis
systems operate on a time-varying basis, using short
segments of speech selected at uniformly spaced time
intervals of 10 to 30 ms duration.
From
four to one hundred banks of filters have
been used in systems to cover the typical speech frequency
band of 0-5 kHz. Very often the spacing and bandwidth
of the analyzing filter progressively increase with
frequency in order to try and mimic the decreasing spectral
resolution and definition of the human ear.
The
Fourier transform provides a mathematical basis for
determining the frequency spectrum of a continuous
time-domain signal. The discrete Fourier transform
(DFT) is an adaptation of this technique.
The speech signal is segmented (windowed) as in the
banks of filters approach. However, the
windows used in the DFT approach may be overlapped.
The duration of the analyzing window must span a few
pitch-periods of the speech signal in order to obtain
good frequency resolution. There is considerable computation
required with the DFT approach. N samples of
signal involves N2 calculations, each requiring
a complex multiplication and addition. The fast Fourier
Transform (FFT) exploits the inherent
redundancy in the DFT to reduce the number of calculations
to
N
log2(N), while achieving identical results.
For voiced speech, the spanning of a few pitch-periods
results in a frequency spectrum in which the discrete
line spectrum of the periodic excitation is multiplied
by the vocal tract spectral envelope. In order to
extract the vocal tract spectral envelope, a technique
known as cepstral truncation is used.
It removes pitch ripple from high-resolution
spectra leaving only the vocal tract transfer function
information. It trades off temporal resolution for
increased spectral resolution. If pitch-period is
to estimated from the cepstrum, the analysis-window
should span at least four or more pitch-periods. Consequently,
if both pitch-period and spectral-envelope information
are to be computed using this method, two analyzing
windows are normally employed. Using an analyzing-window
that spans a few pitch-periods results in spectral
properties and pitch-periods being averaged.
The averaging effect is small and insignificant relative
to changes encountered in normal speech.
The
auto correlation function (ACF)
essentially compares a signal with a delayed copy
of itself. A short-time auto correlation function
is used by isolating successive segments of the speech
signal. The short-time auto correlation function exhibits
peaks at time-shifts corresponding to multiples of
the pitch-period. At these points the speech signal
is in phase with the delayed version of itself, giving
high correlation values. From this it appears that
the short-time auto correlation function should be
a powerful technique for estimating the pitch-period
of voiced speech. However, there are situations when
it is no easier to automatically detect the peaks
in the short-time auto correlation function than in
the time waveform.
Sample
values of speech at instant n can be approximated
as a linear combination of the previous p speech
samples, i.e.
a1x[n
- 1] + a2x[n - 2] . . .+ apx[n
- p],
where
p = 12 is normally sufficient for both voiced
and unvoiced speech, and a1, a2, . .
. ap are predictor coefficients. Prediction
errors, however, occur in each sample. By minimizing
the mean squared error between the actual speech samples
and the linearly-predicted ones, the predictor coefficients
can be determined by solving a set of linear equations.
A set of predictor coefficients can predict the speech
signal reasonably accurately over stationary portions.
In order to match the time-varying properties of the
speech signal, a new set of predictor coefficients
are calculated every 10-30 ms. Two methods for determining
the value of the coefficients used are known as the
auto correlation method and the covariance
method. The main attraction of linear predictive
analysis is that it offers great accuracy and speed
of computation. In addition, the theory underlying
the method has been the subject of intensive research
in recent years and, as a result, is highly developed
and well understood. Based on this theory, a large
variety and range of applications of linear predictive
analysis to speech processing have evolved. Schemes
have been devised for estimating all the basic speech
parameters from linear predictive analysis, such as
spectrum and formant estimation, pitch detection and
glottal pulse shape estimation.
An
averaging process occurs, particularly where the spectrum
is changing rapidly, with consequent blurring
of rapidly varying spectral properties when the speech
analysis is performed over a few tens of milliseconds
or a few pitch-periods. In this type of pitch-asynchronous
analysis, the computed spectrum is the product of
the excitation spectrum and the vocal tract spectrum
and the two are not readily separated. The effect
is to distort the vocal tract spectral
envelope. These problems can be overcome by a pitch-synchronous
analysis where single pitch-periods of the
speech signal are identified and analyzed in isolation.
To eliminate the effects of glottal pulse shape the
analysis should be carried out over closed-glottis
regions where the vocal tract is in force-free oscillation.
The main drawback with any type of pitch-synchronous
analysis is that it is difficult to identify accurately
and automatically the start and finish of each pitch-period
and the closed-glottis region within each pitch-period.
Zero-crossing
rate is a measure of the number of times in
a given time interval that the amplitude of the speech
signal passes through a value of zero. Because of
its random nature, zero-crossing rate for unvoiced
speech is greater than that of voiced speech. Zero-crossing
rate is an important parameter for voiced/unvoiced
classification and for endpoint detection. It is often
used as part of front-end processing in automatic
speech recognition systems.
Detecting
when a speech utterance begins and ends is a basic
problem in speech processing. This is often referred
to as endpoint detection. Endpoint detection
is difficult if the speech is uttered in a noisy environment.
This is especially true when unvoiced sounds occur
at the beginning or end of an utterance.
Many
pitch detection algorithms are based on measurement
of the short-time signal energy and zero-crossing
rate and attempt to detect as accurately as possible
the changes that these quantities undergo at the beginning
and end of an utterance. The basic operation of a
simple algorithm is as follows. A small sample of
background noise is taken during a silence
interval just prior to commencement of the speech
signal. The short-time energy function of the entire
utterance is then computed. A speech threshold is
determined which takes into account silence energy
and peak energy. Initially, endpoints are assumed
to occur where the signal energy crosses this threshold.
Corrections to these initial estimates are then made
by computing zero crossing rate in the vicinity of
endpoints and by comparing it with that of silence.
If detectable changes in zero-crossing rate occur
outside the initial thresholds, endpoints are re-designated
to points at which the changes take place.
Speech
is analyzed in a sequence of time frames in many parametric
speech analysis techniques. Each frame is represented
by a set of k numerical values, for example
16 filter bank outputs, 12 LPC coefficients etc. In
other words, each frame is represented by a k-dimensional
vector in a k-dimensional space. Usually each
parameter in the vector is quantized separately using
a specific number of bits. This is referred to as
scalar quantization which implies that
speech frames are expected to occur uniformly throughout
vector space. A more appropriate approach is to use
vector quantization where vector space
is divided into a number of non-uniform regions (bins)
with each region being represented by a single vector
giving the centroid of the region. The collection
of vector centroids is called a code book.
Each element of the code book is given a unique label
(address).
Formant
frequencies, amplitudes, and bandwidths of the vocal
tract characterize individual speech sounds. A number
of algorithms have been developed to track formants.
One of the simplest attempts to identify three formants
restricted to the spectral frequency band of 0-3 kHz.
If three peaks are found they are assigned to the
formants. Tests have shown that with male speech this
occurs approximately 85-90 percent of the time. A
nearest-neighbor criterion is used
for other cases, i.e. one, two, or more peaks. When
one peak is found, its formant label is assigned as
the nearest formant neighbor from the previous frame
and the two vacant formant slots are filled with their
values in the previous frame. When two peaks are found,
their formant labels are assigned in accordance with
the nearest-neighbor criterion as before and the missing
formant assumes its value in the previous frame. In
the case of four or more peaks, the three peaks closest
to the formant values of the previous are identified
and labeled and the remainder are discarded. Additional
processing is used in some algorithms to resolve closely-spaced
formants and place restrictions on the degree to which
formants can move between frames. Dynamic programming
and neural networks may be used to obtain smooth formant
tracks.
Phonetic
analysis attempts to derive the phonemic structure
of an utterance directly from the speech signal. The
sequence of phonemes correspond to a sequence of articulatory
gestures, which have certain, well-defined acoustic
equivalents. There is no simple one-to-one relationship
between a phoneme and its acoustic equivalents. Co-articulation
effects cause overlap between phonemic boundaries
and the acoustic correlates of a phoneme are modified
by its neighbors. Speakers can distort the acoustic
characteristics of speech sounds to such an extent
that these sounds no longer possess the characteristics
normally associated with them. For example, the vowel
in destroy can become unvoiced, so that
it no longer resembles a vowel.
Phonetic
analysis may be performed by segmenting the speech
signal into phonemic-like units and assign an appropriate
label to each unit. This process, known as segmentation
and labeling involves an analysis of the time-varying
acoustic features of the speech signal. Features used
may include pitch, zero-crossing rate, energy profiles,
spectral shape and formant frequencies and trajectories.
Speech is analyzed in 10-20 ms time-frames and the
acoustic features are extracted. Then the signal is
segmented into phonemic-size units. The acoustic feature
set is analyzed and boundaries are placed where values
exceed pre-determined thresholds. Features used include
energy profiles, zero-crossing rate, and spectral
rate of change. In some systems, the apparent phonetic
context is employed to determine the features and
thresholds to use, instead of using the same features
for each boundary decision. For example, if a time
region is thought to be one of the glides /w, r, l/
then the boundary between the glide and the adjacent
vowel is best located by examining the formant frequency
trajectories. This type of algorithm is based on acoustic-phonetic
knowledge which has to be formalized and stored as
rules in the system. Unfortunately, there is limited
knowledge in this area.
After
boundaries have been located, labels have to be assigned
to each phonemic unit. This is done by comparing the
features of the unit to be labeled with a set of prototypical
features for each phoneme, which are stored in the
system. For speaker-independent analysis, these prototypical
features should be based on data from a wide variety
of speakers. Individual vowels normally can be identified
by the steady-state values of the first three formant
frequency values extracted from the center of the
vowel. Diphthongs are characterized by the values
of the formant frequencies in the initial and final
vowel targets, as well as the rate of change of the
formant trajectories. Nasals and glides always occur
adjacent to a vowel and can be characterized by formant
transitions into and out of the sound. Fricatives
can be detected through the presence or absence of
turbulent noise and often can be identified by their
overall spectral shape. Plosives are characterized
by a period of silence followed by an abrupt increase
in signal level at the point of release, followed
by a burst of frication noise. Plosive types can be
identified by measurements which include the frequency
spectrum of the burst, the formant transitions in
adjacent vowels, and voice on-set time.
Phonetic
analysis using segmentation and labeling is extremely
error-prone. Certain segments may be incorrectly identified.
More than one phoneme may be grouped together in a
single segment and a single phoneme may be split into
more than one segment. One way of accounting for the
possibility of errors is to present the output of
the phonetic analyzer in the form of a phonetic
lattice. This involves listing a number of
candidate phonemes for each time-unit together with
a confidence measure. Ambiguities are then resolved
by higher levels of linguistic processing, which normally
involves matching the possible phonemic sequences
against entries in the system lexicon, in order to
postulate possible utterances and select the most
likely.
Another
approach is to propose sequences of phonemes, use
a speech synthesis by rule algorithm to generate the
acoustic templates, and then compare these with the
input templates. Alternative sequences can be evaluated
and one that provides the closest match is selected.
Speech
recognition with dynamic time-warping
Speech
recognition systems based on acoustic pattern matching
depend on a technique called dynamic time-warping
(DTW) to accommodate time-scale variations.
In isolated word recognition systems the acoustic
pattern or Template of each word in the vocabulary
is stored as a time sequence of features (frames),
derived using one of the speech analysis techniques
described above. Recognition is performed by comparing
the acoustic pattern of the word to be recognized
with the stored patterns and choosing the word which
it matches best as the recognized word. In a speaker-independent
system, an average set of patterns is
previously stored in the system and no training is
required of a speaker. The function of the pattern
matching block of the isolated-word speech recognition
system in the figure below is to determine the similarity
between the input word pattern and the stored word
patterns. This involves not only distance computation
but also time-alignment of the input and reference
patterns because a word spoken on different occasions,
even by the same speaker, will exhibit both local
and global variation in its time-scale. The simplest
method of time-aligning two patterns of unequal length
is to map the time-axis of one onto the time-axis
of the other in a linear fashion. This method has
drawbacks since it does not guarantee that the internal
parts of the patterns will be properly aligned. It
does give proper alignment of the beginning and end
of the patterns. An alignment function that properly
matches the internal features of the pattern is required.
Much of the computational effort in speech pattern
matching is in deriving a near optimal alignment function.
This can be achieved by the technique called dynamic
time-warping (DTW).

Basic
isolated-word isolation in DTW systems
In
speaker-independent speech recognition
systems there is no training of the system to recognize
a particular speaker and so the stored word patterns
must be representative of the collection of speakers
expected to use the system. The word templates are
derived by first obtaining a large number of sample
patterns from a cross-section of talkers of different
sex, age-group and dialect, and then clustering these
to form a representative pattern for each word. A
representative pattern can be created by averaging
all the patterns in a word cluster. A dynamic time-warping
algorithm would normally be employed to compute a
time-alignment function which takes into account the
different time-scales. Another approach is to select
a representative pattern from the middle of each cluster.
Because of the great variability in speech, it is
generally impossible to represent each word cluster
with a single pattern. So each cluster is sub-divided
into sub-clusters and a number of tokens for
each word, i.e. up to twelve, is stored in the system.
All tokens of each word are matched against the input
word.
Choosing
which stored pattern most closely matches the input
pattern (decision rule) is performed in the last stage
in the speech recognition system in the figure above.
An algorithm known as the nearest-neighbor
(NN) rule is used. In speaker-independent
systems a modified version known as the K-nearest-neighbor
may be the choice.
DTW-based
algorithms have been devised to recognize connected
groups of words. One of the first was a two-level
dynamic programming algorithm consisting of
a word-level matching stage followed by a phrase-level
matching stage. In the word-level stage, each stored
word pattern is matched against all possible regions
in the connected-word input pattern. An adjustment
window is used to define a region in which each word
in the connected-word pattern may start and end. This
gives rise to a matrix of partial distances. In the
phrase-level stage, dynamic programming is performed
on the partial distances to obtain the sequence of
words which gives rise to the minimum total distance.
The
level-building algorithm is more efficient,
where dynamic time-warping is applied to a number
of levels up to the maximum number of anticipated
words in the connected-word string.
In
the one-stage or Bridle
algorithm each word pattern is matched against a first
portion of the input pattern using a dynamic time-warping
algorithm. Then some with the best scores together
with their corresponding ending position in the input
pattern are recorded. Then each word pattern is matched
against the second portion of the input pattern starting
at the points where the last word matches ended. This
process is repeated until the end of the input pattern
is reached, and generates what is called a word-decision
tree which grows as the input is processed.
This type of approach is often referred to as a beam
search. It avoids considering unlikely interpretations
of the input pattern, but keeps the options open in
case of ambiguities.
In
some specialized recognition applications such as
airline booking there is some knowledge of the order
in which the vocabulary word will be spoken. This
makes it is possible to represent allowable strings
of words using a directed graph or syntax tree structure.
Speech
recognition with hidden Markov modeling
Hidden
Markov [A. A. Markov (1856-1922)] modeling (HMM)
is a probabilistic pattern-matching technique that
models a time-sequence of speech patterns as the output
of a random process. A way to envision this is to
assume that the occurrence of a given symbol depends
on some number m of immediately preceding symbols.
Thus the information source can be considered to produce
an mth-order Markov chain and is called an
mth order Markov source. For an mth-order Markov
source, the m symbol position are called the
state sj of the source at that symbol position.
It is best to illustrate this with the following example.
Example of a hidden Markov Model
Each
of the six circles represents a state of the model
at a discrete instant in time t, corresponding
to the frame time. The model is in one of these states
and outputs a certain speech pattern or observation.
At time instant t + 1, the model moves to a
new state, or stays in the same state, and emits another
pattern. This process is repeated until the complete
sequence of patterns has been produced. Whether it
stays in the same state or moves to another state
is determined by probabilities, {aij}, associated
with each transition where {aij} denotes the
probability of moving from state i at time
t to j at time t + 1. Note that
in any state the sum of the probabilities of staying
in that state or moving to another state is 1.0. In
any state, the production by the model of speech is
drawn from a finite set of M patterns

using
the technique of vector quantization (described above)
is also governed by a set of probabilities {bjk}
that denote the probability of producing pattern k
when the model is in state j. The starting state
of each model is also uncertain and this is represented
by a probability{pj} which denotes the probability
of the model being in state j at time t
= 0. The sum of the p probabilities across all
the states should equal 1, i.e. p = (0.5, 0.5
0, 0, 0, 0). For the above example the starting state
is either the first or second with an equal probability
of each.

Speech
is a sequence of different sounds, produced by the
speech articulators taking up a sequence of different
positions. With the articulatory positions corresponding
to static sounds as states, speech may be thought
of as the result of a sequence of articulatory states
of different and varying duration. Hence the transitions
between states can be represented by probabilities,
{aij}, and the overall Markov chain represents
the temporal structure of the word. The acoustic patterns
or observations produced in each state correspond
to the sound being articulated at that time. Because
of variations in the shape of the vocal apparatus,
pronunciation etc., the production of these patterns
may also be represented by probabilistic functions
{bjk}.
The
two probability functions are iteratively adjusted
to maximize the likelihood that the training sequence
of patterns could be produced by that model. Large
amounts of training data are generated to produce
good word models. Several repetitions of each word
spoken by several speakers is required for speaker-independent
systems. The speech recognition phase involves computing
the likelihood of generating the unknown input pattern
with each word model and selecting the word model
that gives the greatest likelihood as the recognized
word. This is known as maximum likelihood classification.
The amount of computation involved in recognition
is substantially less than that in training.
The
recognition accuracy of the hidden Markov modeling
is better than that of an equivalent system based
on dynamic time-warping. Its data storage and computation
requirements are approximately an order of magnitude
less than that of DTW. It is easier to capture and
model speaker variability for HMM, although it requires
substantial training computation. HMM can be applied
to sub-word units, such as syllables, demi-syllables,
phones, diphones and phonemes, and has the potential
for implementing large-vocabulary, speaker-independent
systems.
A
complete isolated-word recognition system based on
hidden Markov modeling is illustrated in what follows.
HMM-based isolated speech-word recognition system
The
word to be recognized is suitably isolated (end points
located), split into a time sequence of T frames
and analyzed using some speech analysis procedure,
such as filter bank, fast Fourier transform (FFT),
linear predictive analysis (LPA) etc. This produces
a sequence of observations Ot, t = 1,
2, . . . ,T, which are vector-quantized using
a code book containing a representative set of M
speech patterns Pk, k = 1, 2, . . .
, M. Then the likelihood of producing the unknown
input word pattern with each of the W word models
is computed. The input word is recognized as that
model which produces the greatest likelihood. One
might consider all possible state sequences that could
have produced the observation sequence and then determine
that sequence which gives the highest probability.
However, this is
unrealistic because of the very large number of sequences
involved. Fortunately there are two recursive procedures
which reduce the amount of computation to manageable
proportions. They are the Baum-Welch
algorithm and the Viterbi algorithm.
HMM-based
systems require computationally intensive training,
but relatively simple recognition procedures in comparison
to DTW. HMM also requires substantially less memory
storage than that of DTM.
The
performance of HMM-based systems have been improved
in a number of ways. A mixture of continuous probability
density functions (PDFs), each characterized by its
mean and variance, can be used to replace the discrete
output symbol PDF {bjk}. Such a system is known
as a semi-continuous hidden Markov model
(SCHMM). This gives the added advantage of fewer parameter
values to be estimated from the training data. A continuous
density method (CDHMM) tries to describe the
spectral density of each state for each sub-word unit
in terms of a mixture of distributions. This usually
separates different sub-sord units, especially if
the spectral characteristics of the units are different.
It is interesting to note that the standard HMMs cannot
model word duration variations. It has trouble with
such words as bad and bat.
The incorporation of duration into HMMs to produce
what are known as hidden semi-Markov models
(HSMMs) provides a slight improvement. Increased recognition
performance with HMM has been obtained through the
use of multiple code books. Instead of producing a
standard output feature symbol at each time-frame,
the model might output multiple symbols, including
dynamic features. It has been extended to model sub-word
units. In particular, the use of triphone units to
model both intra-word and inter-word co-articulation
effects indicates that, by using more detailed sub-word
models, which utilize existing phonological knowledge,
it may be possible to build large-vocabulary continuous
speech recognition systems.
Neural
networks
Neural
networks are pattern-matching devices with processing
architectures based on the neural structure of the
human brain. They consist of simple interconnected
processing units (neurons). The strengths of the interconnections
between units (weights) are variable. Of the many
possible architectures one of the more popular is
shown in the next figure, the multi-layer perceptron
(MLP). The processing units are arranged in layers,
an
Multi-layer Preceptron (MLP)
input,
some hidden, and an output. Weighted interconnections
connect each unit in a given layer to every other
unit in an adjacent layer. The network is said to
be feedforward, in that there are no
interconnections between units within a layer and
no connections from outer layers back towards the
input.
The
output of each processing unit, Yj, is some non-linear
function of the weighted sum of the outputs from the
previous layer, i.e.
Yj
= ( Wj0X0 +Wj1X1 + . . .
+ WjXN-1 + aj),
j = 0, 1, 2, . . . ,N-1
where
aj is a bias value added to the sum of the
weighted inputs and represents a non-linear
function. Adjusting Wji may be used to represent
complex non-linear mappings between a pattern vector
presented to the input units and classification patterns
appearing on the output units. In a pattern-matching
applications, the network is trained by presenting
a pattern vector at the input layer and by computing
the outputs. The output is then compared with a set
of output unit values, which will identify the input
pattern. This is normally carried out on the basis
of a particular set of output units being above a
certain value. The error between the actual output
and the desired output is computed and back propagated
through the network to each unit. The input weights
of each unit are then adjusted to minimize this error.
This process is repeated until the actual output matches
the desired output to within some pre-defined error
limit. Many pairs of training input/output patterns
are presented to the network and the process is repeated
for each pair. Training a neural network requires
large amounts of training data and very long training
times, which can sometimes be several hours. Pattern
recognition involves presenting the unknown pattern
to the input nodes of the trained network and computing
the values of the output nodes which identify the
pattern.
"Stay,
you imperfect speakers, tell me more."
Shakespeare's
Macbeth
References
Atal,
Bishnu S. and Rabiner, Lawrence R. 'Speech Research
Directions', AT&T Technical Journal
Bridle,
J. S. and Dodd, L. (1987), 'Formal Grammars and Markov
Models', Royal Signals and Radar Establishment,
Memo. 4051
Crochiere,
Ronald E. and Flanagan, James L. (1986), 'Speech Processing
and Evolving Technology'. AT&T Technical Journal
Vol. 65, Issue 5.
Fink,
Donald G. and Christiansen , Editors (1989) 'Information,
Communication, Noise, and Interference, Section 4-6'
and ' Mathematics: Formulas, Definitions, and Theorems
used in Electronics Engineering, Section 2-7, Electronics
Engineers' Handbook, McGraw-Hill Book Company
Gorin,
Allen L. and Roe David B. (1988), 'Parallel level-building
on a tree machine'. ICASSP
Lee,
Chin-Hui, Rabiner, R. Lawrence and Pieraccini, Roberto
(1991) 'Speaker Independent Continuous Speech Recognition
Using Continuous Density Hidden Markov Models'. NATO
ASI Series, Vol. F75.
Owens,
F. J. (1993), 'Signal Processing of Speech', McGraw-Hill,
Inc.
Martin,
E. A. (1987), 'A Two-stage Isolated-word Recognition
System using Discriminant Analysis', MIT Lincoln
Laboratory, Technical Report 773
Zünkler,
Klaus (1991), 'An ISDN speech server based on speaker
independent continuous Hidden Markov Models', NATO
ASI Series, Vol F 75
Of
Historical Significance
Rice,
S. O. (1944-45), 'Mathematical Analysis of Random
Noise.' Bell System Technical Journal, July
1944, Vol. 23, pp. 282-332, January 1945, Vol. 24,
pp. 46-156.
Shannon,
C. E. (1948), 'A Mathematical Theory of communications',
Bell System Technical Journal, July 1948, Vol.
27, pp. 379-423, 623-656
Weiner,
N. (1949), 'Extrapolation, Interpolation and Smoothing
of Stationary Time Series with Engineering Applications',
MIT Press, Cambridge, Mass.