• How we speak
  • Speech sounds
  • Continuous normal speech
  • Electronic capture of speech
  • Extracting acoustic information from speech
  • Speech recognition with dynamic time-warping
  • Speech recognition with hidden Markov modeling
  • Neural networks
  • References

How we speak

We speak by using combinations of our lungs, windpipe, larynx, and oral and nasal tracts. Vocal cords are two folds of skin on the larynx that blow apart and come together as we force air through the slit (glottis) between them. The oral tract is an irregular tube terminated at the front by lips and at the back by the larynx. Its cross-sectional area is varied by muscular control of the lips, tongue, jaw and velum. The nasal tract is a non-uniform acoustic tube of fixed volume and length terminated at the front by nostrils and the rear by a movable flap of skin (velum). The velum controls acoustic coupling between the oral and nasal tracts. Air expelled from the lungs and forced along the trachea and through the glottis can be controlled in different ways to produce ‘voiced’, ‘unvoiced’, and ‘plosive’ sounds.

Voiced sounds such as ‘aah’ or ‘oh’ are produced when the vocal cords are tensed together and vibrate as the air pressure builds up, forcing the glottis open, and then subsides as the air passes through. This vibration has a frequency spectrum rich in harmonics at multiples of the fundamental frequency (pitch). Speakers vary pitch with air pressure in the lungs and tension on the vocal cords.

Unvoiced sounds may be fricative or aspirated. Vocal cords do not vibrate for either. Fricative sounds such as ‘s’ or ‘sh’ are generated at some point in the vocal tract. As air is forced past it turbulence occurs causing a random noise. Since the points of constriction tend to occur near the front of the mouth, the resonances of the vocal tract have little effect on sound being produced. In aspirated sounds, such as ‘h’ of ‘hello’. turbulent airflow occurs at the glottis as the vocal cords are held slightly apart. Resonances of the vocal tract modulate the spectrum of the random noise as heard in whispered speech.

Plosive sounds, such as the ‘puh’ sound at the beginning of the word ‘pin’ or the ‘duh’ sound at the beginning of ‘din’, are created when the vocal tract is closed at some point, allowing air pressure to build up before it is suddenly released. This transient excitation may occur with or without vocal cord vibration.

Resonances produced by the tubular shape of the vocal tract are called formants. The vocal tract may assume many different shapes giving rise to different resonant or formant frequency values (sounds). Formant frequencies are constantly changing in continuous speech.

Speech sounds

Speech technologists consider speech (in some languages) as being a sequence of basic sound units called phonemes. Phonemes, however, may not be directly observed in a speech signal. A phoneme may have many different sounds or allophones depending on its surrounding phonemes. Different individuals may produce the same string of phonemes to convey the same information, however they sound different as a result of variations in dialect, accent and physiology. Phonemes correspond directly with ‘articulatory positions and movement’ (articulatory gestures). Articulatory gestures may be either static or dynamic events. Phonemes or articulatory gestures in the English language are classified into vowels, semivowels, liquids, diphthongs, nasals, etc.

Vowel gestures are produced with static articulators. Sound radiates from the mouth with no nasal coupling. Tongue shape remains fairly fixed and each vowel is characterized by the forward/backward and raised/lowered positions of the tongue. Vowels may be classified as front, as in ‘feet, did, red, and mat’, middle, as in ‘heard, cut, the’, or back, as in ‘card, cod, board, wood, and rude’, depending on the forward-backward positioning of the tongue during articulation. Acoustically, each vowel is characterized by the values of the first three or four resonances (formants) of the vocal tract. Semivowels consist of glides and liquids. Glides, i.e. went and ran, like diphthongs, are also dynamic sounds except that articulators move much more rapidly from one static vowel position to another. Liquids, i.e. let, and you are static gestures with the oral tract partially closed at some point.

Diphthongs, are a combination of two vowel sounds. Although similar to vowels, the gesture is created when the articulators move slowly from one static vowel position to another.

Nasals such as found in ‘man, now and sing’ are produced by vocal-cord excitation with the vocal tract totally constricted at some point along the oral passageway and with the velum lowered so that sound is radiated from the nostrils.

Fricatives are sounds produced when turbulent air-flow occurs at a point of constriction in the vocal tract. The point of constriction occurs near the front of the mouth and its exact location characterizes the particular fricative sound produced. Sound is radiated from the lips via the front cavity. The back cavity traps energy at certain frequencies and so introduces anti-resonances into the perceived sound. Unvoiced fricatives, as in ‘fat, thin, sit, and ship’, are produced without vocal-cord vibration whereas in their voiced counterparts, ‘van, this, zoo, and azure’, the vocal cords are vibrating. The phoneme as in hat may be regarded as an unvoiced fricative even though it does not have a voiced counterpart. It is produced with turbulent excitation at the glottis and with the articulatory position of the vowel which succeeds it. Plosives are generated by forming a complete closure of the vocal tract, allowing air pressure to build up and then suddenly released. The presence or absence of vocal-cord vibration distinguishes the voiced stops, ‘bad, din, and gone’, from their unvoiced counterparts, ‘pin, ton, and kill’. The point of closure in the vocal tract determines the voiced/unvoiced plosive produced. Plosives are characterized by transient bursts of energy and as a result their properties are highly influenced by the sounds which precede or succeed them.

Affricates are either unvoiced as in the word church or voiced as in the word judge. These sounds are produced when a stop and fricative consonant are both shortened and combined.

Continuous normal speech

Target articulatory positions for many of the gestures will never be actually reached. As a particular gesture is being produced, the next is already being anticipated and this modifies the way in which the articulators move. To a lesser extent, an articulatory gesture may also depend on the preceding one. This phenomenon is known as co-articulation and results in a smearing of sounds into one another. It also gives rise to an effect called allophonic variation, that is, each phoneme may have many different allophones depending on its surrounding phonemes. For example, the ‘l’ sound in the word ‘leap’ and the word ‘lawn’ are the same phoneme, yet are different allophones. Co-articulation is mainly responsible for making speech sound natural rather than stilted. Super imposed on the basic sequence of gestures are variations in intonation or pitch, rhythm or timing and intensity or loudness. These variations in a utterance are collectively known as prosody and can greatly influence its meaning.

Intonation and rhythm can be combined to emphasize certain gestures and produce an effect called stress. For example, in the sentence ‘I don't have a black pen but I have a blue one’ stress would be put on the words ‘black’ and ‘blue’. Rhythm on its own can affect the syntax of a word and helps to distinguish nouns from verbs, such as extract vs. extract, and adjectives from verbs, such as separate vs. separate. Intonation can have an effect on meaning too. Pitch very often rises towards the end of a question, the extent and abruptness of the rise depending on whether actual information or confirmation is required in the answer. For example, contrast the utterances ‘It's a nice day, isn't it?’ and ‘This is the right train, isn't it?’. In the first case the pitch does not rise so significantly at the end of the sentence whereas in the second case, where information is required, the pitch rise is much greater and more abrupt. Very often in normal speech the prosodic pattern is used to convey emotional state and attitude of the speaker. Through our ‘tone of voice’ we can indicate sarcasm, anger, joy, sadness, fear etc.

Electronic capture of speech

Speech is captured by a sound-responsive element in a microphone that converts variable sound pressure into equivalent variations of an electrical signal, i.e. current or voltage. This analog signal is then sampled and quantized into a digital bit stream (format). Sampling is the process of obtaining values of the analog signal at discrete instants of time T, and quantization refers to the conversion of the amplitude at each sampling instant into a discrete binary number with a specified bit-length. This two-stage process is sometimes referred to as pulse code modulation (PCM). The number of samples per second (frequency) ƒS in Hz is equal to the reciprocal of the sample period, i.e. ƒS = 1/T. Nyquist's (Kotelnikov and Shannon were also involved) sampling theorem states that the sampling frequency must be at least twice the highest frequency component present in the signal. If fewer samples are used a phenomenon known as aliasing occurs where a lower frequency signal may appear upon re-construction.

Speech generally contains frequency components with significant energies up to about 10 kHz. Spectra of the majority of speech sounds however only have significant spectral energy content up to about 5 kHz. Only the fricative and aspirated sounds exhibit significant spectral energy above this value. Speech limited to 5 kHz is perfectly intelligible and suffers no significant degradation. Telephone speech is limited to 3.3 kHz and although the transmission of speech through the telephone network degrades its quality significantly, it still remains largely intelligible, since the information-bearing formants are concentrated in the region below

3.3 kHz. The sampling rate for speech will normally lie in the range

6-20 kHz. In all such situations, a pre-sampling or anti-aliasing filter is required in order to remove frequency components above the Nyquist frequency.

Quantization involves converting the amplitude of the sample values into digital form using a finite number of binary digits (bits). The number of bits used affects the speech quality as well as the number of bits per second (bit rate) required to store or transmit the digital signal. Uniform and logarithmic are two types of quantization used.

For uniform quantization the amplitude range is divided into 2N distinct levels, where N is the number of bits used to represent each digitized sample. If two's complement arithmetic is used the maximum and minimum signal values are

{2(N-1) - 1}/2(N-1)Vref and -Vref,

where Vref is the dc. reference voltage for the quantizer. The quantizer step-size is equal to 2Vref/2N. The signal-to -noise ratio (SNR) in decibels* (dB) is equal to 1.76 + 6.02N.

*The bel, named in honor of Alexander Graham Bell, is defined as the common logarithm of the ratio of two powers, P1 and P2. The decibel is given by 10 log (P2/P1).

Hence each additional bit in the quantizer contributes approximately 6 dB improvement in dynamic range. Speech exhibits a dynamic range of some 50-60 dB and therefore in theory 8- or 9-bit quantization should provide the necessary signal-to-noise ratio for good-quality speech. The mean-square-value of a typical speech signal with peak values approaching that of the maximum allowable value will be much less than a maximum amplitude sine wave. Consequently the signal-to-noise ratio is reduced. Different speakers produce different loudness levels and even the same speaker may produce constantly fluctuating levels. It is virtually impossible to optimally use the available dynamic range. For these reasons, at least 11- or 12-bit quantization is generally used in high-quality speech-processing applications.

Step-size quantization is used in linear systems to accommodate the relatively few peaks in the signal. Low-level signals, such as fricatives will then have a large quantization error. A solution to this problem is to make the quantizer vary step-size in accordance with the signal amplitude, i.e. the step-size is increased as signal level increases. Non-uniform quantization can provide an improved signal-to-noise ratio with a fewer number of bits. This results in a greater efficiency in computer storage and data transmission. The speech signal ultimately has to be re-converted to its original form and an exact inverse of the quantizer's characteristics is required. The process of non-linearly quantizing (compressing) a signal and re-construction (expanding) it is known as companding, a contraction of compressing-expanding. Companding devices are available on a single chip and are called codecs, a contraction of coder-decoder. They normally use an 8 kHz sampling rate at 8 bits per sample, which corresponds to a bit-rate of 64 kbits/s.

Two widely used companding types are the A-law and the µ-law. The A-law is used largely in Europe and the µ-law in the United States. Both the A-law and the µ-law are based on a logarithmic rather than a linear quantizer characteristic.

Log-PCM codecs are widely used in digital telephony channels. A sampling-rate of 8 kHz is employed and with 8 bits allocated to each sample, the corresponding transmission or storage rate is

8 kHz x 8 = 64 k bits/s.

The major limitation of the quantizers described thus far is that they cannot handle amplitude variations in the speech signal between different speakers, or between voiced and unvoiced segments of a signal from the same speaker. This can be ameliorated by dynamically adjusting the step-size of the quantizer to adapt to the changing signal characteristics. This technique is referred to as adaptive pulse code modulation (APCM). The adaptation can be performed at every sample (instantaneous adaptation) or every few samples, i.e. every 10-20 ms (syllabic adaptation). APCM uses 1 bit per sample less than that of log-PCM for a given signal-to-noise ratio in telephone-quality speech. APCM provides an improvement of approximately 6 dB in signal-to- noise ratio over log-PCM.

Speech signals may show considerable similarity between adjacent samples, especially in regions of voiced speech. Consequently, the difference signal formed by subtracting adjacent samples has a lower variance and dynamic range than the speech signal itself. For a given signal-to-noise ratio it can be encoded using fewer bits. This feature is exploited in differential PCM (DPCM) by estimating the current speech sample, subtracting it from the actual current sample and quantizing the difference signal. The re-constructed signal is obtained by adding the quantized difference signal to the signal estimate. The better the prediction, the more accurate the signal value estimate, i.e. the smaller the quantization error in the difference signal the better the signal-to-noise ratio. DPCM can obtain an improvement of about 5-6 dB in signal-to-noise ratio or a saving of about 1 bit per sample for a given signal-to-noise ratio. The saving is not as large as might be expected, however, since in actual practice the dynamic range of the error signal can approach that of the signal itself.

Delta modulation (DM) is a special case of the DPCM. The bit-rate of a DM system is equal to the sampling rate, but the sampling rate must be several times that of a conventional PCM system. If the sampling rate is too low, a condition known as slope overload may occur, where the system is incapable of tracking a fast-changing signal. The overload can be reduced by using either a higher sampling rate or a larger step-size. A larger step-size increases the ‘granular noise’, due to the alternate 1s and 0s, which is particularly noticeable when no input is present. At low signal-to-noise ratios, the bit-rate for DM is slightly lower than that of log-PCM. However, for signal-to-noise ratio approaching or greater than telephone-quality, the required bit-rate is higher.

A substantial improvement can be obtained by adding an adaptive quantizer and/or an adaptive predictor. The resulting systems are all generally referred to as adaptive differential PCM (ADPCM) systems. Synchronization of the transmitter and receiver is achieved by having essentially a replica of the receiver embedded in the transmitter and using only the transmitted difference signal in determining step-size adaptation, in both the quantizer and the inverse quantizer, and in predicting the next signal estimate. Step-size adaptation is typically a mixture of both instantaneous and syllabic adaptation and is achieved by making the adaptation signal dependent on the magnitude of the difference signal as well as its rate of change. In this way, the system can adapt better to both stationary and non-stationary speech. The predictor coefficients may be calculated and updated every 10-20 ms, by solving a set of linear equations. Alternatively, they may be updated every sample, using an optimization algorithm, which uses a hill-climbing technique to minimize the error between the signal estimate and the actual signal.

The complexity of an ADPCM system is directly related to the complexity of the predictor algorithm. Low-to-medium complexity systems use an adaptive quantizer and a fixed predictor and are capable of reproducing speech with slightly less than telephone-quality at 3 or 4 bits/sample at an 8 kHz sampling rate. High-complexity systems, using an adaptive quantizer and an adaptive predictor and operating at 32 kbits/s, can improve the signal-to-noise ratio further and can reproduce speech with better than telephone-quality.

Improvement in delta modulation (DM), described above, can be obtained by adapting the step-size of the quantizer. In a rapidly changing signal condition, the step-size can increase quickly up to its maximum value reducing the possibility of overload. During silence, the step-size is decreased to its minimum value, which determines the level of granular noise in the idle condition. Adaptive DM (ADM) systems using very simple algorithms are capable of reproducting speech of very good quality at bit-rates in the range of 32 to 48 bits/s.

The main drawback with adaptive delta modulation is that since step-size can vary instantaneously with sudden changes in the input signal, the system can take a long time to recover from transmission errors, causing degradation of the speech quality. A solution to this problem is to make the step-size variation slower than the instantaneous variation in the speech signal. This has the effect of increasing the likelihood of slope-overload distortion, but leads to a reduction in the level of granular noise. The step-size control signal is effectively generated by low-pass filtering the step-size changes indicated, by observing the most recent (3-5) bit outputs of the quantizer. The time-constant of the low-pass filter is typically of the order of 5-10 ms. Systems employing these principles are generally known as continuously variable slope delta modulation (CVSD) systems. CVSD systems can produce log-PCM quality speech at about 40 kbits/s and slightly less than telephone-quality at 32 kbits/s. This does not represent any improvement over ADM. However, CVSD can prove useful for communications-quality speech at bit-rates of 16 kbits and below. Added attractions of CVSD in this area of application are its simplicity and its robustness to transmission errors when used on a noisy channel.

Extracting acoustic information from speech

The ‘source-filter model’ is the basis for many of the techniques used to extract acoustic information from the speech signal. It consists of a uniform tube or pipe of length L, with a sound source at one end to represent vocal cords and open at the other to represent lips. It has odd frequency resonances of ƒ0, 3ƒ0, 5ƒ0, . . . etc., where ƒ0 = c/3L. c is the velocity of sound in air ,i.e. approximately 340 meters/second. Resonant frequency values of 500 HZ, 1000 Hz, 1500 Hz, . . . etc. are produced by a vocal tract (pipe) 17 cm long. Resonances produced in the vocal tract are called formants. In this model, the excitation source is assumed to be linearly separable from the transmission characteristics of the vocal tract, which are represented by a quasi-time-invariant filter. The speech waveform itself is then assumed to be the output of this filter in response to the excitation source, which is either a quasi-periodic pulse generator (voiced sounds), a random-noise generator (unvoiced sounds) or a mixture of both (voiced fricatives). ‘Speech analysis’ is primarily the process of estimating the relatively slowly time-varying parameters which specify the filter, from a speech signal that is assumed to be the output of that filter. Other goals include voiced/unvoiced classification and pitch-period estimation for voiced speech.

According to the source-filter model there is an overall

-6 dB/octave trend as frequency increases. This is a combination of a -12 dB/octave trend due to the voiced excitation source and +6 dB/octave trend due to radiation from the mouth. This means that, for each doubling in frequency, the signal amplitude, and hence the measured vocal tract response, is reduced by a factor of 16. It is therefore desirable to compensate for the -6 dB/octave roll-off by pre-processing the speech signal to give a +6 dB/octave lift in the appropriate range so that the measured spectrum has a similar dynamic range across the entire frequency band. This is referred to as pre-emphasis. It is unnecessary to apply pre-emphasis in the case of unvoiced speech since there is no spectral trend to removed. However, unvoiced speech is included for simplicity of implementation, .

Speech analysis may be performed in either the frequency- or time-domain. The major goal is to estimate the frequency response of the vocal tract. Techniques involving ‘a bank of bandpass filters’ (spectral segmentation), discrete Fourier transformation (DFT) and homomorphic or cepstral processing may be used. Time-domain measurements such as auto correlation, zero- crossing rate and signal energy can also be used to extract limited but useful information about the speech signal. Since the rate at which the speech spectrum changes is limited by physiological constraints -of the lips, tongue, jaw, etc., most speech analysis systems operate on a time-varying basis, using short segments of speech selected at uniformly spaced time intervals of 10 to 30 ms duration.

From four to one hundred ‘banks of filters’ have been used in systems to cover the typical speech frequency band of 0-5 kHz. Very often the spacing and bandwidth of the analyzing filter progressively increase with frequency in order to try and mimic the decreasing spectral resolution and definition of the human ear.

The Fourier transform provides a mathematical basis for determining the frequency spectrum of a continuous time-domain signal. The discrete Fourier transform (DFT) is an adaptation of this technique. The speech signal is segmented (windowed) as in the ‘banks of filters’ approach. However, the windows used in the DFT approach may be overlapped. The duration of the analyzing window must span a few pitch-periods of the speech signal in order to obtain good frequency resolution. There is considerable computation required with the DFT approach. N samples of signal involves N2 calculations, each requiring a complex multiplication and addition. The fast Fourier Transform (FFT) exploits the inherent redundancy in the DFT to reduce the number of calculations to

N log2(N), while achieving identical results. For voiced speech, the spanning of a few pitch-periods results in a frequency spectrum in which the discrete line spectrum of the periodic excitation is multiplied by the vocal tract spectral envelope. In order to extract the vocal tract spectral envelope, a technique known as cepstral truncation is used. It removes ‘pitch ripple’ from high-resolution spectra leaving only the vocal tract transfer function information. It trades off temporal resolution for increased spectral resolution. If pitch-period is to estimated from the cepstrum, the analysis-window should span at least four or more pitch-periods. Consequently, if both pitch-period and spectral-envelope information are to be computed using this method, two analyzing windows are normally employed. Using an analyzing-window that spans a few pitch-periods results in spectral properties and pitch-periods being ‘averaged’. The averaging effect is small and insignificant relative to changes encountered in normal speech.

The auto correlation function (ACF) essentially compares a signal with a delayed copy of itself. A short-time auto correlation function is used by isolating successive segments of the speech signal. The short-time auto correlation function exhibits peaks at time-shifts corresponding to multiples of the pitch-period. At these points the speech signal is in phase with the delayed version of itself, giving high correlation values. From this it appears that the short-time auto correlation function should be a powerful technique for estimating the pitch-period of voiced speech. However, there are situations when it is no easier to automatically detect the peaks in the short-time auto correlation function than in the time waveform.

Sample values of speech at instant n can be approximated as a linear combination of the previous p speech samples, i.e.

a1x[n - 1] + a2x[n - 2] . . .+ apx[n - p],

where p = 12 is normally sufficient for both voiced and unvoiced speech, and a1, a2, . . . ap are predictor coefficients. Prediction errors, however, occur in each sample. By minimizing the mean squared error between the actual speech samples and the linearly-predicted ones, the predictor coefficients can be determined by solving a set of linear equations. A set of predictor coefficients can predict the speech signal reasonably accurately over stationary portions. In order to match the time-varying properties of the speech signal, a new set of predictor coefficients are calculated every 10-30 ms. Two methods for determining the value of the coefficients used are known as the auto correlation method and the covariance method. The main attraction of linear predictive analysis is that it offers great accuracy and speed of computation. In addition, the theory underlying the method has been the subject of intensive research in recent years and, as a result, is highly developed and well understood. Based on this theory, a large variety and range of applications of linear predictive analysis to speech processing have evolved. Schemes have been devised for estimating all the basic speech parameters from linear predictive analysis, such as spectrum and formant estimation, pitch detection and glottal pulse shape estimation.

An averaging process occurs, particularly where the spectrum is changing rapidly, with consequent ‘blurring’ of rapidly varying spectral properties when the speech analysis is performed over a few tens of milliseconds or a few pitch-periods. In this type of pitch-asynchronous analysis, the computed spectrum is the product of the excitation spectrum and the vocal tract spectrum and the two are not readily separated. The effect is to ‘distort’ the vocal tract spectral envelope. These problems can be overcome by a pitch-synchronous analysis where single pitch-periods of the speech signal are identified and analyzed in isolation. To eliminate the effects of glottal pulse shape the analysis should be carried out over closed-glottis regions where the vocal tract is in force-free oscillation. The main drawback with any type of pitch-synchronous analysis is that it is difficult to identify accurately and automatically the start and finish of each pitch-period and the closed-glottis region within each pitch-period.

Zero-crossing rate is a measure of the number of times in a given time interval that the amplitude of the speech signal passes through a value of zero. Because of its random nature, zero-crossing rate for unvoiced speech is greater than that of voiced speech. Zero-crossing rate is an important parameter for voiced/unvoiced classification and for endpoint detection. It is often used as part of front-end processing in automatic speech recognition systems.

Detecting when a speech utterance begins and ends is a basic problem in speech processing. This is often referred to as endpoint detection. Endpoint detection is difficult if the speech is uttered in a noisy environment. This is especially true when unvoiced sounds occur at the beginning or end of an utterance.

Many pitch detection algorithms are based on measurement of the short-time signal energy and zero-crossing rate and attempt to detect as accurately as possible the changes that these quantities undergo at the beginning and end of an utterance. The basic operation of a simple algorithm is as follows. A small sample of background noise is taken during a ‘silence’ interval just prior to commencement of the speech signal. The short-time energy function of the entire utterance is then computed. A speech threshold is determined which takes into account silence energy and peak energy. Initially, endpoints are assumed to occur where the signal energy crosses this threshold. Corrections to these initial estimates are then made by computing zero crossing rate in the vicinity of endpoints and by comparing it with that of ‘silence’. If detectable changes in zero-crossing rate occur outside the initial thresholds, endpoints are re-designated to points at which the changes take place.

Speech is analyzed in a sequence of time frames in many parametric speech analysis techniques. Each frame is represented by a set of k numerical values, for example 16 filter bank outputs, 12 LPC coefficients etc. In other words, each frame is represented by a k-dimensional vector in a k-dimensional space. Usually each parameter in the vector is quantized separately using a specific number of bits. This is referred to as scalar quantization which implies that speech frames are expected to occur uniformly throughout vector space. A more appropriate approach is to use vector quantization where vector space is divided into a number of non-uniform regions (bins) with each region being represented by a single vector giving the centroid of the region. The collection of vector centroids is called a code book. Each element of the code book is given a unique label (address).

Formant frequencies, amplitudes, and bandwidths of the vocal tract characterize individual speech sounds. A number of algorithms have been developed to track formants. One of the simplest attempts to identify three formants restricted to the spectral frequency band of 0-3 kHz. If three peaks are found they are assigned to the formants. Tests have shown that with male speech this occurs approximately 85-90 percent of the time. A ‘nearest-neighbor’ criterion is used for other cases, i.e. one, two, or more peaks. When one peak is found, its formant label is assigned as the nearest formant neighbor from the previous frame and the two vacant formant slots are filled with their values in the previous frame. When two peaks are found, their formant labels are assigned in accordance with the nearest-neighbor criterion as before and the missing formant assumes its value in the previous frame. In the case of four or more peaks, the three peaks closest to the formant values of the previous are identified and labeled and the remainder are discarded. Additional processing is used in some algorithms to resolve closely-spaced formants and place restrictions on the degree to which formants can move between frames. Dynamic programming and neural networks may be used to obtain smooth formant tracks.

Phonetic analysis attempts to derive the phonemic structure of an utterance directly from the speech signal. The sequence of phonemes correspond to a sequence of articulatory gestures, which have certain, well-defined acoustic equivalents. There is no simple one-to-one relationship between a phoneme and its acoustic equivalents. Co-articulation effects cause overlap between phonemic boundaries and the acoustic correlates of a phoneme are modified by its neighbors. Speakers can distort the acoustic characteristics of speech sounds to such an extent that these sounds no longer possess the characteristics normally associated with them. For example, the vowel in destroy can become unvoiced, so that it no longer resembles a vowel.

Phonetic analysis may be performed by segmenting the speech signal into phonemic-like units and assign an appropriate label to each unit. This process, known as segmentation and labeling involves an analysis of the time-varying acoustic features of the speech signal. Features used may include pitch, zero-crossing rate, energy profiles, spectral shape and formant frequencies and trajectories. Speech is analyzed in 10-20 ms time-frames and the acoustic features are extracted. Then the signal is segmented into phonemic-size units. The acoustic feature set is analyzed and boundaries are placed where values exceed pre-determined thresholds. Features used include energy profiles, zero-crossing rate, and spectral rate of change. In some systems, the apparent phonetic context is employed to determine the features and thresholds to use, instead of using the same features for each boundary decision. For example, if a time region is thought to be one of the glides /w, r, l/ then the boundary between the glide and the adjacent vowel is best located by examining the formant frequency trajectories. This type of algorithm is based on acoustic-phonetic knowledge which has to be formalized and stored as rules in the system. Unfortunately, there is limited knowledge in this area.

After boundaries have been located, labels have to be assigned to each phonemic unit. This is done by comparing the features of the unit to be labeled with a set of prototypical features for each phoneme, which are stored in the system. For speaker-independent analysis, these prototypical features should be based on data from a wide variety of speakers. Individual vowels normally can be identified by the steady-state values of the first three formant frequency values extracted from the center of the vowel. Diphthongs are characterized by the values of the formant frequencies in the initial and final vowel targets, as well as the rate of change of the formant trajectories. Nasals and glides always occur adjacent to a vowel and can be characterized by formant transitions into and out of the sound. Fricatives can be detected through the presence or absence of turbulent noise and often can be identified by their overall spectral shape. Plosives are characterized by a period of silence followed by an abrupt increase in signal level at the point of release, followed by a burst of frication noise. Plosive types can be identified by measurements which include the frequency spectrum of the burst, the formant transitions in adjacent vowels, and voice on-set time.

Phonetic analysis using segmentation and labeling is extremely error-prone. Certain segments may be incorrectly identified. More than one phoneme may be grouped together in a single segment and a single phoneme may be split into more than one segment. One way of accounting for the possibility of errors is to present the output of the phonetic analyzer in the form of a phonetic lattice. This involves listing a number of candidate phonemes for each time-unit together with a confidence measure. Ambiguities are then resolved by higher levels of linguistic processing, which normally involves matching the possible phonemic sequences against entries in the system lexicon, in order to postulate possible utterances and select the most likely.

Another approach is to propose sequences of phonemes, use a speech synthesis by rule algorithm to generate the acoustic templates, and then compare these with the input templates. Alternative sequences can be evaluated and one that provides the closest match is selected.

Speech recognition with dynamic time-warping

Speech recognition systems based on acoustic pattern matching depend on a technique called dynamic time-warping (DTW) to accommodate time-scale variations. In isolated word recognition systems the acoustic pattern or Template of each word in the vocabulary is stored as a time sequence of features (frames), derived using one of the speech analysis techniques described above. Recognition is performed by comparing the acoustic pattern of the word to be recognized with the stored patterns and choosing the word which it matches best as the recognized word. In a speaker-independent system, an ‘average’ set of patterns is previously stored in the system and no training is required of a speaker. The function of the pattern matching block of the isolated-word speech recognition system in the figure below is to determine the similarity between the input word pattern and the stored word patterns. This involves not only distance computation but also time-alignment of the input and reference patterns because a word spoken on different occasions, even by the same speaker, will exhibit both local and global variation in its time-scale. The simplest method of time-aligning two patterns of unequal length is to map the time-axis of one onto the time-axis of the other in a linear fashion. This method has drawbacks since it does not guarantee that the internal parts of the patterns will be properly aligned. It does give proper alignment of the beginning and end of the patterns. An alignment function that properly matches the internal features of the pattern is required. Much of the computational effort in speech pattern matching is in deriving a near optimal alignment function. This can be achieved by the technique called dynamic time-warping (DTW).

Basic isolated-word isolation in DTW systems

In speaker-independent speech recognition systems there is no training of the system to recognize a particular speaker and so the stored word patterns must be representative of the collection of speakers expected to use the system. The word templates are derived by first obtaining a large number of sample patterns from a cross-section of talkers of different sex, age-group and dialect, and then clustering these to form a representative pattern for each word. A representative pattern can be created by averaging all the patterns in a word cluster. A dynamic time-warping algorithm would normally be employed to compute a time-alignment function which takes into account the different time-scales. Another approach is to select a representative pattern from the middle of each cluster. Because of the great variability in speech, it is generally impossible to represent each word cluster with a single pattern. So each cluster is sub-divided into sub-clusters and a number of tokens for each word, i.e. up to twelve, is stored in the system. All tokens of each word are matched against the input word.

Choosing which stored pattern most closely matches the input pattern (decision rule) is performed in the last stage in the speech recognition system in the figure above. An algorithm known as the nearest-neighbor (NN) rule is used. In speaker-independent systems a modified version known as the K-nearest-neighbor may be the choice.

DTW-based algorithms have been devised to recognize connected groups of words. One of the first was a two-level dynamic programming algorithm consisting of a word-level matching stage followed by a phrase-level matching stage. In the word-level stage, each stored word pattern is matched against all possible regions in the connected-word input pattern. An adjustment window is used to define a region in which each word in the connected-word pattern may start and end. This gives rise to a matrix of partial distances. In the phrase-level stage, dynamic programming is performed on the partial distances to obtain the sequence of words which gives rise to the minimum total distance.

The level-building algorithm is more efficient, where dynamic time-warping is applied to a number of levels up to the maximum number of anticipated words in the connected-word string.

In the one-stage or Bridle algorithm each word pattern is matched against a first portion of the input pattern using a dynamic time-warping algorithm. Then some with the best scores together with their corresponding ending position in the input pattern are recorded. Then each word pattern is matched against the second portion of the input pattern starting at the points where the last word matches ended. This process is repeated until the end of the input pattern is reached, and generates what is called a word-decision tree which ‘grows’ as the input is processed. This type of approach is often referred to as a beam search. It avoids considering unlikely interpretations of the input pattern, but keeps the options open in case of ambiguities.

In some specialized recognition applications such as airline booking there is some knowledge of the order in which the vocabulary word will be spoken. This makes it is possible to represent allowable strings of words using a directed graph or syntax tree structure.

Speech recognition with hidden Markov modeling

Hidden Markov [A. A. Markov (1856-1922)] modeling (HMM) is a probabilistic pattern-matching technique that models a time-sequence of speech patterns as the output of a random process. A way to envision this is to assume that the occurrence of a given symbol depends on some number m of immediately preceding symbols. Thus the information source can be considered to produce an mth-order Markov chain and is called an mth order Markov source. For an mth-order Markov source, the m symbol position are called the state sj of the source at that symbol position. It is best to illustrate this with the following example.


Example of a hidden Markov Model

Each of the six circles represents a state of the model at a discrete instant in time t, corresponding to the frame time. The model is in one of these states and outputs a certain speech pattern or observation. At time instant t + 1, the model moves to a new state, or stays in the same state, and emits another pattern. This process is repeated until the complete sequence of patterns has been produced. Whether it stays in the same state or moves to another state is determined by probabilities, {aij}, associated with each transition where {aij} denotes the probability of moving from state i at time t to j at time t + 1. Note that in any state the sum of the probabilities of staying in that state or moving to another state is 1.0. In any state, the production by the model of speech is drawn from a finite set of M patterns

using the technique of vector quantization (described above) is also governed by a set of probabilities {bjk} that denote the probability of producing pattern k when the model is in state j. The starting state of each model is also uncertain and this is represented by a probability{pj} which denotes the probability of the model being in state j at time t = 0. The sum of the p probabilities across all the states should equal 1, i.e. p = (0.5, 0.5 0, 0, 0, 0). For the above example the starting state is either the first or second with an equal probability of each.

Speech is a sequence of different sounds, produced by the speech articulators taking up a sequence of different positions. With the articulatory positions corresponding to static sounds as states, speech may be thought of as the result of a sequence of articulatory states of different and varying duration. Hence the transitions between states can be represented by probabilities, {aij}, and the overall Markov chain represents the temporal structure of the word. The acoustic patterns or observations produced in each state correspond to the sound being articulated at that time. Because of variations in the shape of the vocal apparatus, pronunciation etc., the production of these patterns may also be represented by probabilistic functions {bjk}.

The two probability functions are iteratively adjusted to maximize the likelihood that the training sequence of patterns could be produced by that model. Large amounts of training data are generated to produce good word models. Several repetitions of each word spoken by several speakers is required for speaker-independent systems. The speech recognition phase involves computing the likelihood of generating the unknown input pattern with each word model and selecting the word model that gives the greatest likelihood as the recognized word. This is known as maximum likelihood classification. The amount of computation involved in recognition is substantially less than that in training.

The recognition accuracy of the hidden Markov modeling is better than that of an equivalent system based on dynamic time-warping. Its data storage and computation requirements are approximately an order of magnitude less than that of DTW. It is easier to capture and model speaker variability for HMM, although it requires substantial training computation. HMM can be applied to sub-word units, such as syllables, demi-syllables, phones, diphones and phonemes, and has the potential for implementing large-vocabulary, speaker-independent systems.

A complete isolated-word recognition system based on hidden Markov modeling is illustrated in what follows.

HMM-based isolated speech-word recognition system

The word to be recognized is suitably isolated (end points located), split into a time sequence of T frames and analyzed using some speech analysis procedure, such as filter bank, fast Fourier transform (FFT), linear predictive analysis (LPA) etc. This produces a sequence of observations Ot, t = 1, 2, . . . ,T, which are vector-quantized using a code book containing a representative set of M speech patterns Pk, k = 1, 2, . . . , M. Then the likelihood of producing the unknown input word pattern with each of the W word models is computed. The input word is recognized as that model which produces the greatest likelihood. One might consider all possible state sequences that could have produced the observation sequence and then determine that sequence which gives the highest probability. However, this is unrealistic because of the very large number of sequences involved. Fortunately there are two recursive procedures which reduce the amount of computation to manageable proportions. They are the Baum-Welch algorithm and the Viterbi algorithm.

HMM-based systems require computationally intensive training, but relatively simple recognition procedures in comparison to DTW. HMM also requires substantially less memory storage than that of DTM.

The performance of HMM-based systems have been improved in a number of ways. A mixture of continuous probability density functions (PDFs), each characterized by its mean and variance, can be used to replace the discrete output symbol PDF {bjk}. Such a system is known as a semi-continuous hidden Markov model (SCHMM). This gives the added advantage of fewer parameter values to be estimated from the training data. A continuous density method (CDHMM) tries to describe the spectral density of each state for each sub-word unit in terms of a mixture of distributions. This usually separates different sub-sord units, especially if the spectral characteristics of the units are different. It is interesting to note that the standard HMMs cannot model word duration variations. It has trouble with such words as ‘bad’ and ‘bat’. The incorporation of duration into HMMs to produce what are known as hidden semi-Markov models (HSMMs) provides a slight improvement. Increased recognition performance with HMM has been obtained through the use of multiple code books. Instead of producing a standard output feature symbol at each time-frame, the model might output multiple symbols, including dynamic features. It has been extended to model sub-word units. In particular, the use of triphone units to model both intra-word and inter-word co-articulation effects indicates that, by using more detailed sub-word models, which utilize existing phonological knowledge, it may be possible to build large-vocabulary continuous speech recognition systems.

Neural networks

Neural networks are pattern-matching devices with processing architectures based on the neural structure of the human brain. They consist of simple interconnected processing units (neurons). The strengths of the interconnections between units (weights) are variable. Of the many possible architectures one of the more popular is shown in the next figure, the multi-layer perceptron (MLP). The processing units are arranged in layers, an

Multi-layer Preceptron (MLP)

input, some hidden, and an output. Weighted interconnections connect each unit in a given layer to every other unit in an adjacent layer. The network is said to be feedforward, in that there are no interconnections between units within a layer and no connections from outer layers back towards the input.

The output of each processing unit, Yj, is some non-linear function of the weighted sum of the outputs from the previous layer, i.e.

Yj = ƒ( Wj0X0 +Wj1X1 + . . . + WjXN-1 + aj),

= 0, 1, 2, . . . ,N-1

where aj is a bias value added to the sum of the weighted inputs and ƒ represents a non-linear function. Adjusting Wji may be used to represent complex non-linear mappings between a pattern vector presented to the input units and classification patterns appearing on the output units. In a pattern-matching applications, the network is trained by presenting a pattern vector at the input layer and by computing the outputs. The output is then compared with a set of output unit values, which will identify the input pattern. This is normally carried out on the basis of a particular set of output units being above a certain value. The error between the actual output and the desired output is computed and back propagated through the network to each unit. The input weights of each unit are then adjusted to minimize this error. This process is repeated until the actual output matches the desired output to within some pre-defined error limit. Many pairs of training input/output patterns are presented to the network and the process is repeated for each pair. Training a neural network requires large amounts of training data and very long training times, which can sometimes be several hours. Pattern recognition involves presenting the unknown pattern to the input nodes of the trained network and computing the values of the output nodes which identify the pattern.

"Stay, you imperfect speakers, tell me more."

Shakespeare's Macbeth


Atal, Bishnu S. and Rabiner, Lawrence R. 'Speech Research Directions', AT&T Technical Journal

Bridle, J. S. and Dodd, L. (1987), 'Formal Grammars and Markov Models', Royal Signals and Radar Establishment, Memo. 4051

Crochiere, Ronald E. and Flanagan, James L. (1986), 'Speech Processing and Evolving Technology'. AT&T Technical Journal Vol. 65, Issue 5.

Fink, Donald G. and Christiansen , Editors (1989) 'Information, Communication, Noise, and Interference, Section 4-6' and ' Mathematics: Formulas, Definitions, and Theorems used in Electronics Engineering, Section 2-7, Electronics Engineers' Handbook, McGraw-Hill Book Company

Gorin, Allen L. and Roe David B. (1988), 'Parallel level-building on a tree machine'. ICASSP

Lee, Chin-Hui, Rabiner, R. Lawrence and Pieraccini, Roberto (1991) 'Speaker Independent Continuous Speech Recognition Using Continuous Density Hidden Markov Models'. NATO ASI Series, Vol. F75.

Owens, F. J. (1993), 'Signal Processing of Speech', McGraw-Hill, Inc.

Martin, E. A. (1987), 'A Two-stage Isolated-word Recognition System using Discriminant Analysis', MIT Lincoln Laboratory, Technical Report 773
Zünkler, Klaus (1991), 'An ISDN speech server based on speaker independent continuous Hidden Markov Models', NATO ASI Series, Vol F 75

Of Historical Significance

Rice, S. O. (1944-45), 'Mathematical Analysis of Random Noise.' Bell System Technical Journal, July 1944, Vol. 23, pp. 282-332, January 1945, Vol. 24, pp. 46-156.

Shannon, C. E. (1948), 'A Mathematical Theory of communications', Bell System Technical Journal, July 1948, Vol. 27, pp. 379-423, 623-656

Weiner, N. (1949), 'Extrapolation, Interpolation and Smoothing of Stationary Time Series with Engineering Applications', MIT Press, Cambridge, Mass.

For information on products and services, contact FGC at
1-800-7070-FGC (1-800-707-0342)
or email

Copyright ©2000-2007 Fifth Generation Computer Corporation.
All rights reserved.