FifthGen™ SRS Speech Recognition Software and
Development Tools

The FifthGen Speech Recognition Software consists of a set of libraries that include core speech recognition functions as well as auxiliary ones such as low-level audio capture. The FifthGen Speech Recognition Software is based on software licensed from Carnegie Mellon University, one of the leading research institutions in the United States. The FifthGen SRS Speech Recognition Software includes a customized Speech Data Model which is created for each customer's system during the installation phase. This Data Model has three components:

  1. A Phonetic Dictionary;
  2. One or more Language Models; and,
  3. An Acoustic Model.

Building a Specific Acoustic Model with related Language Models (LM's)

You may wish to compare the acoustic model, which is a computer program, to a robot that needs to be trained. In order to train the model, we need to collect thousands of speech samples of actual customers responding to the question, "What City, please?" These samples are then reviewed by FGC analysts to accurately record what the person said, using a program which was designed by FGC engineers for this purpose. Training the model is a complex, sophisticated process which requires a set of computer programs and special skills on the part of the speech systems analyst.

In order to deliver the best performance for each unique installation, a new customized model must be created using speech samples that are as identical as possible to those utterances which will be recognized by the final installed system. Different tones, emphasis, range of accents, gender ratios, age ratios, and other factors all contribute to the acoustic model and thus to final performance. This unique process involves the use of specialized computer programs and is referred to as "training" the system, since the system is "listening" to hundreds of thousands of speech samples in order to improve its accuracy in recognizing spoken words from this region.

Creating the Customized Data Model for a Directory Assistance Application

In the initial planning stages for installing a Directory Assistance System, a list of the localities to be included in the locality database, along with other parameters, must be specified. This list is then processed through a unique computer program to create the phonetic dictionary, which phonetically describes the locality database, the state database, and other listings and phrases to be recognized by the system and their pronunciations. The language model describes the different words and phrases to be understood and can be customized for each geographic region. One of the unique characteristics of the FifthGen Speech Recognition System is its ability to generate multiple language models within the same system, so that the system can recognize different languages or dialects, if called upon.

The major goals of building and refining a speech recognition acoustic model to be used in a "Directory Assistance" application for telephone networks are:         

  1. To improve the accuracy of the recognizer.
  2. To perform the recognition task as quickly as possible.
  3. To refine the method for filtering out the so-called "noise words" that may come before the speaker utters a locality name such as, in, um, uh, for, hi, it's, or yes;
  4. To reduce the number of false positive recognitions to 5% or less.

Maintaining the Pronunciation Dictionary

The Pronunciation Dictionary contains phonetic descriptions for each item in the locality database.  Any changes to the locality database necessitate changes to the Pronunciation Dictionary.  Maintenance of the Pronunciation Dictionary, which includes additions, deletions, and changes, is performed on a regular basis with the objective of improving overall recognition performance.  For more detailed information about maintaining the Pronunciation Dictionary, please refer to FGC Manual “SRS-PM-002.”

Confidence Estimation (Controlling False Positive Recognitions)

The FifthGen SRS Recognizer supports a feature named Confidence Estimation, which provides an assessment of the probable correctness of a recognition result in the form of a confidence value. For example, a value of .9 would indicate a 90% probability that the recognition result is accurate. These probability estimates are based on the accumulated experience of the system in correctly recognizing customer responses to the prompt, "What City, please?" Very simply, if the system has been accurate 95% of the time in recognizing "St. Paul" in large scale testing of the system, then it will probably be accurate 95% of the time in the future. This also means that the system will probably return a false positive recognition in 5% of the responses (for example, it might return "St. Cloud" instead of "St. Paul").

The confidence estimation factors allow us to refine the overall model in order to set overall parameters for false positive recognition of city and state names. For example, based on actual experience, it is possible to set a parameter of approximately 5% for false positive recognitions.

The FifthGen Speech Analysis Workstation Software

In order to customize the Acoustic Model and train it to recognize responses in different languages or in different geographic regions, it may be necessary to analyze over 500,000 speech samples during the preparation and installation phase of a new project. Using the SRS Speech Analysis Workstation Software, FGC trained speech analysts listen to the recorded samples and "tag" each sample with various indicators and the digital representation of what words were spoken, e. g. "Dallas."  For more detailed information about using the SRS Speech Analysis Workstation Software, please refer to the FGC Manual “SRS-TM-001.”

Quicklinks

Website Development by Nexxite