IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 10, NO. 8, NOVEMBER 2002
Distributed Speech Processing in MiPad’s Multimodal User Interface
Li Deng, Senior Membe , IEEE, Kuansan Wang, Alex Acero, Senior Membe , IEEE, Hsiao-Wuen Hon, Senior Membe , IEEE, Jasha Droppo, Membe , IEEE, Constantinos Boulis, Ye-Yi Wang, Membe , IEEE, Derek Jacoby, Milind Mahajan, Ciprian Chelba, and Xuedong D. Huang, Fello , IEEE
Abstract—This paper describes the main components of MiPad (Multimodal Interactive PAD) and especially its distributed speech processing aspects. MiPad is a wireless mobile PDA prototype that enables users to accomplish many common tasks using a multi- modal spoken language interface and wireless-data technologies. It fully integrates continuous speech recognition and spoken lan- guage understanding, and provides a novel solution for data entry in PDAs or smart phones, often done by pecking with tiny styluses or typing on minuscule keyboards. Our user study indicates that the throughput of MiPad is significantly superior to that of the ex- isting pen-based PDA interface.
Acoustic modeling and noise robustness in distributed speech recognition are key components in MiPad’s design and implemen- tation. In a typical scenario, the user speaks to the device at a dis- tance so that he or she can see the screen. The built-in microphone thus picks up a lot of background noise, which requires MiPad be noise robust. For complex tasks, such as dictating e-mails, resource limitations demand the use of a client–server (peer-to-peer) archi- tecture, where the PDA performs primitive feature extraction, fea- ture quantization, and error protection, while the transmitted fea- tures to the server are subject to further speech feature enhance- ment, speech decoding and understanding before a dialog is carried out and actions rendered. Noise robustness can be achieved at the client, at the server or both. Various speech processing aspects of this type of distributed computation as related to MiPad’s poten- tial deployment are presented in this paper. Recent user interface study results are also described. Finally, we point out future re- search directions as related to several key MiPad functionalities.
Index Terms—Client–server computing, distributed speech recognition, error protection, mobile computing, noise robustness, speech-enabled applications, speech feature compression, speech processing systems.
HE GRAPHICAL user interface (GUI) has significantly improved computer human interface by using intuitive real-world metaphors. However, it is still far from achieving the ultimate goal of allowing users to interact with computers without much training. In addition, GUI relies heavily on a graphical display, keyboard and pointing devices that are not always available. Mobile computers have constraints on physical size and battery power, or present limitations due to T
Manuscript received September 25, 2001; revised August 2, 2002. The as- sociate editor coordinating the review of this manuscript and approving it for publication was Dr. Harry Printz.
Deng, K. Wang, A. Acero, H.-W. Hon, J. Droppo, Y.-Y. Wang, D. Jacoby,
Mahajan, C. Chelba, and X. D. Huang are with Microsoft Research, Red-
mond, WA 98052 USA (e-mail: firstname.lastname@example.org). C. Boulis is with the University of Washington, Seattle, WA 98195 USA. Digital Object Identifier 10.1109/TSA.2002.804538
hands-busy eyes-busy scenarios which make traditional GUI a challenge. Spoken language enabled multimodal interfaces are widely believed to be capable of dramatically enhancing the usability of computers because GUI and speech have complementary strengths. While spoken language has the potential to provide a natural interaction model, the difficulty in resolving the ambiguity of spoken language and the high computational requirements of speech technology have so far prevented it from becoming mainstream in a computer’s user interface. MiPad, Multimodal Interactive PAD, is a prototype of a wireless Personal Digital Assistant (PDA) that enables users to accomplish many common tasks using a multimodal spoken language interface (speech pen display). A key research goal for MiPad is to seek out appropriate venues for applying spoken language technologies to address the user interface challenges mentioned above. One of MiPad’s hardware design
concepts is shown in Fig. 1.
MiPad intends to alleviate a prevailing problem of pecking with tiny styluses or typing on minuscule keyboards in today’s PDAs by adding speech capability through a built-in micro- phone. Resembling more like a PDA and less like a telephone, MiPad intentionally avoids speech-only interactions. MiPad is designed to support a variety of tasks such as E-mail, voice-mail, calendar, contact list, notes, web browsing, mobile phone, and document reading and annotation. This collection of functions unifies the various mobile devices into a single, comprehensive communication and productivity tool. The idea is therefore sim- ilar to other speech enabled mobile device efforts reported in , , , . While the entire functionality of MiPad can be accessed by pen alone, we found a better user experi- ence can be achieved by combining pen and speech inputs. The user can dictate to an input field by holding the pen down on it. Other pointing devices, such as a roller on the side of the device, device for navigating among the input fields, can also be employed to enable one handed operation. The speech input method, called Tap & Talk, not only indicates where the rec- ognized text should go but also serves as a push to talk button. Tap & Talk narrows down the number of possible utterances for the spoken language processing module. For example, selecting the “To: field” on an e-mail application display indicates that the user is about to enter a name. This dramatically reduces the complexity of spoken language processing and cuts down the speech recognition and understanding errors to the extent that MiPad can be made practically usable despite the current lim- itations of robust speech recognition and natural language pro- cessing technology.
1063-6676/02$17.00 © 2002 IEEE