DENG et al.: DISTRIBUTED SPEECH PROCESSING IN MiPad’S MULTIMODAL USER INTERFACE
the SPLICE technology, and some evaluation results. We then outlined some recent work on speech feature compression and error protection necessary to enable distributed speech recog- nition in MiPad. Various other MiPad system components, in- cluding user interface, HMM-based speech modeling, unified language model, and spoken language understanding, are also discussed. The remaining MiPad system components, i.e., di- alog management, as well as its interaction with the spoken lan- guage understanding component, are not included in this paper; readers are referred to  and  for a detailed discussion of this topic.
making appointment, showing comparisons of the pen-only interface with the Tap and Talk interface. The standard deviation is shown above the bar of each
2) Is It Easier to Get The Job Done?: Fifteen out of the 16 participants in the evaluation stated that they preferred using the Tap & Talk interface for creating new appointments and all 16 said they preferred it for writing longer emails. The pref- erence data is consistent with the task completion times. Error correction for the Tap & Talk interface remains as one of the most unsatisfactory features. On a seven-point Likert scale, with one being “disagree” and seven being “agree,” participants re- sponded with a 4.75 that it was easy to recover from mistakes.
Fig. 8 summarizes the quantitative user study results on task completion times of email transcription and of making appoint- ment, showing comparisons of the pen-only interface with the Tap & Talk interface. The standard deviation is shown above the bar of each performed task.
Future development of spoken language systems in a mobile environment beyond MiPad will require us and the rest of the research community to face much greater challenges than we have encountered during the development of MiPad. One promising future direction for noise robustness which we will pursue includes intelligent combination of nonparametric approaches (such as SPLICE) and parametric approaches that take advantage of accurate knowledge of the physical nature of speech distortion. For example, knowledge of the phase relationship between the clean speech and the corrupting noise has been shown to be beneficial in providing better prior information in robust statistical feature extraction than the environment models which do not take account of the phase information , . A combination of accurate acoustic environment models and knowledge about the speech distortion learned directly from stereo data will enable the recognizer’s front end to effectively combat wider types and levels of speech distortion than our current algorithms can handle.
For future speech recognition technology to be usable in a mobile environment, it is necessary to break from the con- strained vocabulary tasks as well as the relatively constrained speaker style. For example, in order to enable users to freely dictate e-mails, especially to friends and relatives, it may be difficult to constrain the vocabulary size and the strict dicta- tion-like speaking style. More powerful speech recognition technology may be needed to achieve final success in such applications.
This paper describes work in progress in the development of a consistent human–computer interaction model and corre- sponding component technologies for multimodal applications. Our current applications comprise mainly PIM functions. De- spite this incomplete implementation, we have observed that speech and pen have the potential to significantly improve user experience in our preliminary user study. Thanks to the multi- modal interaction, MiPad also offers a far more compelling user experience than standard voice-only telephony interaction.
Though Moore’s law also tells us that all the processing may be done in the device itself in the future, the success of the current MiPad depends on an always-on wireless connection. With upcoming 3G wireless deployments in sight, the critical challenge for MiPad remains the accuracy and efficiency of our spoken language systems since it is likely that MiPad will be used in noisy environments with no availability of a close- talking microphone, and the server also needs to support a large number of MiPad clients.
To meet this challenge, much of our recent work has focused on the noise-robustness and transmission efficiency aspects of the MiPad system. In this paper, we first described our new front-end speech processing algorithm development, based on
In the speech understanding and dialog management areas, we expect the multimodal integration in mobile environments to play a more dominant role than in the current MiPad. For example, the understanding component must be able to infer users’ intention by integrating signals from a variety of input media. Cross-modality reference resolution becomes a key issue here. However, the increase in input modalities, together with a larger speech lexicon, will require understanding algorithms that deal more effectively with even higher perplexity. We an- ticipate that better dialog contextual management supplemented with external models of user preference will prove beneficial in successfully handling such a high-perplexity problem.
 A. Acero and R. Stern, “Robust speech recognition by normalization of the acoustic space,” in Proc. Int. Conf. on Acoustics, Speech and Signal Processing, Toronto, ON, Canada, 1991.  M. Afify et al., “Evaluating the Aurora connected digit recognition task: A Bell Labs approach,” in Proc. Eurospeech Conf., Aalborg, Denmark, Sept. 2001.  L. Comerford, D. Frank, P. Gopalakrishnan, R. Gopinath, and J. Sedivy, “The IBM personal speech assistant,” in Proc. Int. Conf. on Acoustics, Speech, and Signal Processing, vol. 1, Salt Lake City, UT, May 2001.