IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 10, NO. 8, NOVEMBER 2002
TABLE I CROSS-DOMAIN SPEAKER-INDEPENDENT SPEECH RECOGNITION PERFORMANCE WITH THE UNIFIED LANGUAGE MODEL AND ITS CORRESPONDING DECODER
Inside each CFG, however, we can still derive
from the existing
gram ( -gram probability inheritance)
so that they are appropriately normalized . This unified approach can be regarded as a generalized n-gram in which the vocabulary consists of words and structured classes. The structured class can be simple, such as DATE , TIME , and NAME , if there is no need to capture deep structural information. It can be made complicated also in order to contain deep structured information. The key advantage of the unified language model is that we can author limited CFGs for each new domain and embed them into the domain-independent
grams. In short, CFGs capture domain-specific structural
information that facilitates language model portability, while the use of -grams makes the speech decoding system robust against catastrophic errors.
The spoken language understanding (SLU) engine used in MiPad is based on a robust chart parser  and a plan-based di- alog manager , . Each semantic object defined and used for SLU is either associated with a real-world entity or an action that the application takes on a real-entity. Each semantic object has slots that are linked with their corresponding CFG. In con- trast to the sophisticated prompting response in voice-only con- versational interface, the response is a direct graphic rendering of the semantic object on MiPad’s display. After a semantic ob- ject got updated, the dialog manager fulfills the plan by exe- cuting application logic and error repair strategy.
One of the critical tasks in SLU is semantic grammar au- thoring. It is necessary to collect a large amount of real data to enable the semantic grammar to yield a decent coverage. For spontaneous PIM application, MiPad SLU engine’s slot parsing error rate in the general Tap and Talk field is above 40%. About half of these errors are due to the free-form text that are related to email or meeting subjects.
Most decoders can only support either CFGs or word
grams. These two ways of representing sentence probabil-
ities were mutually exclusive. We modified our decoder so that we can embed CFGs in the -gram search framework to take advantage of the unified language model. An evaluation of the use of the unified language model is shown in Table I. The speech recognition error rate with the use of the unified language model is demonstrated to be significantly lower than that with the use of the domain-independent trigram. That is, incorporating the CFG into the language model drastically improves cross-domain portability. The test data shown in Table I are based on MiPad’s PIM conversational speech. The domain-independent trigram language model is based on Microsoft Dictation trigram models used in Microsoft Speech SDK 4.0. In Table I, we also observe that using the unified language model directly in the decoding stage produces about
10% fewer recognition errors than doing using the identical language model. This importance of using the unified model in speech decoding.
demonstrates the the early stage of
After collecting additional MiPad data, we are able to reduce the SLU parsing error by more than 25%, which might still be insufficient to be useful. Fortunately, with our imposed context constraints in the Tap and Talk interface, where slot-specific lan- guage and semantic models can be leveraged, most of today’s SLU technology limitations can be overcome.
V. MiPad USER INTERFACE DESIGN AND EVALUATION
As mentioned previously, MiPad does not employ speech synthesis as an output method. This design decision is moti- vated mainly by the following two reasons. First, despite the significant progress in synthesis technologies, especially in the area of concatenated waveforms, the quality of synthesized speech has remained unsatisfactory for large scale deployments. This is also evident as the majority of commercial telephony speech applications still rely heavily on pre-recorded speech, with synthesized speech playing a minor role. The most critical drawback of speech output, however, is perhaps not with the quality of synthesized speech, which hopefully can be further