DENG et al.: DISTRIBUTED SPEECH PROCESSING IN MiPad’S MULTIMODAL USER INTERFACE
Noise-normalized SPLICE denoising using the iterative stochastic algorithm for tracking nonstationary noise in an utterance of the Aurora2 data with an
5 dB. From top to bottom panels are noisy speech, clean speech, and denoised speech, all in the same spectrogram format.
approaches taken can be classified into either the model-do- main or the feature-domain one, with respective strengths and weaknesses. The approach reported in this paper is unique in that it takes full advantage of the rich information embedded in the stereo training data which most directly characterize the relationship between the clean and noise speech feature vectors. The approach is also unique in that a powerful noise tracking algorithm is exploited to effectively compensate for possible mismatch between the operating condition and the condition under which the SPLICE parameters are trained. The feature-domain approach we have taken is based on the special DSR considerations for MiPad architecture.
III. FEATURE COMPRESSION AND ERROR PROTECTION
In addition to noise robustness, we recently also started work on feature compression (source coding) and error protection (channel coding) required by MiPad’s client–server architec- ture. This work is intended to address the three key requirements for successful deployment of distributed speech recognition as- sociated with the client–server approach: 1) compression of cep- stral features (via quantization) must not degrade speech recog- nition performance; 2) the algorithm for source and channel coding must be robust to packet losses, bursty or otherwise; and 3) the total time delay due to the coding, which results from a combined quantization delay, error-correction coding delay, and transmission delay, must be kept within an acceptable level. In
this section, we outline the basic approach and preliminary re- sults of this work.
A. Feature Compression
A new source coding algorithm has been developed that con- sists of two sequential stages. After the standard Mel-cepstra are extracted, each speech frame is first classified to a phonetic category (e.g., phoneme) and then is vector quantized (VQ) using the split-VQ approach. The motivation behind this new source coder is that the speech signal can be composed of piece- wise-stationary segments, and therefore can be most efficiently coded using one of many small codebooks that is tuned into a particular segment. Also, the purpose of the source coding con- sidered here is to reduce the effect of coding on the speech rec- ognizer’s word error rate on the server-side of MiPad, which is very different from the usual goal of source coding aiming at maintaining perceptual quality of speech. Therefore, the use of phone-dependent codebooks is deemed most appropriate since phone distinction can be enhanced by using separate codebooks for distinct phones. Phone distinction often leads to word dis- tinction, which is the goal of speech recognition and also the ultimate goal of the feature compression in MiPad.
One specific issue to be addressed in the coder design is bit allocation, or the number of bits that must be assigned to the subvector codebooks. In our coder, C0, C1–6, and C7–12 are
separate subvectors that are quantized independently. Starting from 0 bits for each subvector codebook of each phone