DENG et al.: DISTRIBUTED SPEECH PROCESSING IN MiPad’S MULTIMODAL USER INTERFACE
improved, but with the nonpersistent or volatile nature of speech presentation. The human user must process the speech message and memorize the contents of the message in real time. There is no known user interface design that can elegantly assist the human user for the cases where the speech waveform cannot be easily heard and understood, or there is simply too much information to be absorbed. In contrast, a graphical display can render a large amount of information persistently for leisure consumption, avoiding the aforementioned problems.
TABLE II COMPLEMENTARY STRENGTHS OF PEN AND SPEECH AS INPUT MODALITIES
MiPad takes advantage of the graphical display in UI design. The graphical display simplifies dramatically the dialog man- agement. For instance, MiPad is able to considerably streamline the confirmation and error repair strategy as all the inferred user intentions are confirmed implicitly on the screen. Whenever an error occurs, the user can correct it through the GUI or speech modalities that are appropriate and appear more natural to the user. Thanks to the display persistency, users are not obligated to correct errors immediately after they occur. The display also allows MiPad to confirm and ask the user many questions in a single turn. Perhaps the most interesting usage of the display, however, is the Tap & Talk interface.
each slot-dependent language and semantic model. In addition, Tap & Talk functions as a user-initiative dialog-state specifica- tion. The dialog focus that leads to the language model is en- tirely determined by the field tapped by the user. As a result, even though a user can navigate freely using the stylus in a pure GUI mode, there is no need for MiPad to include any special mechanism to handle spoken dialog focus and digression.
isual Feedback for Speech Inputs
A. Tap & Talk Interface
Because of MiPad’s small form-factor, the present pen-based methods for getting text into a PDA (Graffiti, Jot, soft keyboard) are potential barriers to broad market acceptance. Speech is gen- erally not as precise as a mouse or a pen to perform position-re- lated operations. Speech interaction can also be adversely af- fected by unexpected ambient noise, despite the use of denoising algorithms in MiPad. Moreover, speech interaction could be am- biguous without appropriate context information. Despite these disadvantages, speech communication is not only natural but also provides a powerful complementary modality to enhance the pen-based interface if the strengths of using speech can be appropriately leveraged and the technology limitations be over- come. In Table II, we elaborate several cases which show that pen and speech can be complementary and used effectively for handheld devices. The advantage of pen is typically the weak- ness of speech and vice versa.
Through usability studies, we also observed that users tend to use speech to enter data and pen for corrections and pointing. Three examples in Table III illustrate that MiPad’s Tap and Talk interface can offer a number of benefits. MiPad has a field that is always present on the screen as illustrated in MiPad’s start page in Fig. 7(a) (the bottom gray window is always on the screen). Tap & Talk is a key feature of the MiPad’s user interface de- sign. The user can give commands by tapping the Tap & Talk field and talking to it. Tap & Talk avoids the speech detec- tion problem that is critical to the noisy environments encoun- tered in MiPad’s deployments. The appointment form shown on MiPad’s display is similar to the underlying semantic objects. By tapping on the attendees field in the calendar card shown in Fig. 7(b), for example, the semantic information related to po- tential attendees is used to constrain both CSR and SLU, leading to a significantly reduced error rate and dramatically improved throughput. This is because the perplexity is much smaller for
Processing latency is a well recognized issue in user interface design. This is even more so for MiPad in which distributed speech recognition is employed. In addition to the recognition process itself, the wireless network further introduces more latency that sometimes is not easily controllable. Conventional wisdom for UI design dictates that filling the time with visual feedback not only significantly improves the usability, but also prevents users from adversely intervening an ongoing process that cannot be easily recoverable. For these reasons, MiPad adopts a visual feedback for speech inputs. In addition, we have designed the visual feedback to help the user avoid a common cause for recognition error—waveform clipping. As the user speaks, MiPad displays a running graphical volume meter reflecting the loudness of the recorded speech right beneath the input field being dictated to. When the utterance is beyond the normal dynamic range, red bars are shown to instruct the user to lower the voice volume. When MiPad detects the end of a user utterance and sends the speech feature to the host computer for processing, a progress bar is overlaid on top of the volume meter. Although the underlying speech application program interface (SAPI) can raise an event whenever the recognizer exits a word node on the grammar, we found channeling back this event to MiPad consumes too much network traffic, which seems to outweigh the benefits of a detail and precise progress report. As a result, the current implementation employs a best attempt estimate on the recognition and understanding progress, not unlike the progress bar commonly seen in a Web browser. The progress estimation is computed solely on the client side with no network traffic involved. Before the outcome is served back to MiPad, the user can click a cancel button next to the status bar to stop the processing at the host computer. If the status bar vanishes without changing the display, it indicates that the utterance has been rejected either by the recognizer or by the understanding system. MiPad’s error repair strategy is entirely user initiative: the user can decide to try again or do something else.