DENG et al.: DISTRIBUTED SPEECH PROCESSING IN MiPad’S MULTIMODAL USER INTERFACE
MiPad’s client–server (peer-to-peer) architecture. The client is based
client–server communication is currently implemented on a wireless LAN.
cent work on front-end speech processing, including noise ro- bustness and source/channel coding that underlie MiPad’s dis- tributed speech recognition capabilities. The acoustic and lan- guage models used for the decoding phase of MiPad contin- uous speech recognition, together with the spoken language un- derstanding component, are presented in Section IV. Finally, MiPad’s user interface and user study results are described in Section V, and a summary is provided in Section VI.
ning the latest version of the algorithm. With this consideration in mind, while designing noise-robust algorithms for MiPad, we strive to make the algorithms front-end agnostic. That is, the algorithms should make no assumptions on the structure and processing of the front end and merely try to undo whatever acoustic corruption has been shown during training. This con- sideration also favors approaches in the feature rather than the model domain.
II. ROBUSTNESS TO ACOUSTIC ENVIRONMENTS
Immunity to noise and channel distortion is one of the most important design considerations for MiPad. For this device to be acceptable to the general public, it is desirable to remove the need for a close-talking microphone. However, with the convenience of using the built-in microphone, noise robustness becomes a key challenge to maintaining desirable speech recognition and understanding performance. Our recent work on acoustic modeling for MiPad has focused on overcoming this noise-robustness challenge. In this section we will present most recent results in the framework of distributed speech recognition (DSR) that the MiPad design has adopted.
Here, we describe one particular algorithm that has so far given the best performance on the Aurora2 task and other Mi- crosoft internal tasks. We called the algorithm SPLICE, short for Stereo-based Piecewise Linear Compensation for Environ- ments. In a DSR system, SPLICE may be applied either within the front end on the client device, or on the server, or on both with collaboration. Certainly a server side implementation has some advantages as computational complexity becomes less of an issue and continuing improvements can be made to benefit even devices already deployed in the field. Another useful prop- erty of SPLICE in the server implementation is that new noise conditions can be added as they are identified by a server. This can make SPLICE quickly adaptable to any new acoustic envi- ronment with minimum additional resources.
A. Distributed Speech Recognition Considerations for Algorithm Design
There has been a great deal of interest recently in standard- izing DSR applications for a plain phone, PDA, or a smart phone where speech recognition is carried out at a remote server. To overcome bandwidth and infrastructure cost limita- tions, one possibility is to use a standard codec on the device to transmit the speech to the server where it is subsequently decompressed and recognized. However, since speech recog- nizers such as the one in MiPad only need some features of the speech signal (e.g., Mel-cepstrum), bandwidth can be further saved by transmitting only those features. ETSI has been accepting proposals for Aurora , an effort to standardize a DSR front-end that addresses the issues surrounding robustness to noise and channel distortions at a low bit rate. Our recent work on noise robustness for MiPad has been concentrated on the Aurora tasks.
In DSR applications, it is easier to update software on the server because one cannot assume that the client is always run-
B. Basic Version of SPLICE
SPLICE is a frame-based, bias removal algorithm for cep- strum enhancement under additive noise, channel distortion or a combination of the two. In , we reported the approximate MAP formulation of the algorithm, and more recently in , ,  we described the MMSE formulation of the algorithm with a much wider range of naturally occurring noises, including both artificially mixed speech and noise, and naturally recorded noisy speech.
SPLICE assumes no explicit noise model, and the noise char- acteristics are embedded in the piecewise linear mapping be- tween the “stereo” clean and distorted speech cepstral vectors. The piecewise linearity is intended to approximate the true non- linear relationship between the two. The nonlinearity between the clean and distorted (including additive noise) cepstral vec- tors arises due to the use of the logarithm in computing the cep- stra. Stereo data refers to simultaneously recorded waveforms both on clean and noisy speech. SPLICE is potentially able to handle a wide range of distortions, including nonstationary